Video search

Applied Filters

Publication Date

People

Publications

48 Results for: Book/Issue: AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceEdit SearchSave SearchRSS

Searched The ACM Guide to Computing Literature (3,856,489 records)|Limit your search to The ACM Full-Text Collection (778,916 records)

Showing 1 - 20of48 Results

Filters

Select All

Export Citations Save to Binder

per page:

Recency

research-article
February 2024
Zero-shot aerial object detection with visual description regularization
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 770, Pages 6926–6934https://doi.org/10.1609/aaai.v38i7.28518

Existing object detection models are mainly trained on large-scale labeled datasets. However, annotating data for novel aerial object classes is expensive since it is time-consuming and may require expert knowledge. Thus, it is desirable to study label-...
1
Metrics
Total Citations1
research-article
February 2024
Diverse and aligned audio-to-video generation via text-to-video model adaptation
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 738, Pages 6639–6647https://doi.org/10.1609/aaai.v38i7.28486

We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the ...
0
Metrics
Total Citations0
research-article
February 2024
Multi-modal prompting for open-vocabulary video visual relationship detection
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 724, Pages 6513–6521https://doi.org/10.1609/aaai.v38i7.28472

Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily ...
0
Metrics
Total Citations0
research-article
February 2024
A multimodal, multi-task adapting framework for video action recognition
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 613, Pages 5517–5525https://doi.org/10.1609/aaai.v38i6.28361

Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend ...
0
Metrics
Total Citations0
research-article
February 2024
Bias-conflict sample synthesis and adversarial removal debias strategy fo temporal sentence grounding in video
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 504, Pages 4533–4541https://doi.org/10.1609/aaai.v38i5.28252

Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue, which is caused by the uneven temporal distribution of the target moments for samples with similar semantic components in input videos or query texts. Existing methods resort ...
0
Metrics
Total Citations0
research-article
February 2024
Towards balanced alignment: modal-enhanced semantic modeling for video moment retrieval
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 429, Pages 3855–3863https://doi.org/10.1609/aaai.v38i4.28177

Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the ...
3
Metrics
Total Citations3
research-article
February 2024
TD²-net: toward denoising and debiasing for dynamic scene graph generation
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 389, Pages 3495–3503https://doi.org/10.1609/aaai.v38i4.28137

Dynamic scene graph generation (SGG) focuses on detecting objects in a video and determining their pairwise relationships. Existing dynamic SGG methods usually suffer from several issues, including 1) Contextual noise, as some frames might contain ...
0
Metrics
Total Citations0
research-article
February 2024
Exploring domain incremental video highlights detection with the LiveFood benchmark
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 1132, Pages 10155–10163https://doi.org/10.1609/aaai.v38i9.28880

Video highlights detection (VHD) is an active research field in computer vision, aiming to locate the most user-appealing clips given raw video inputs. However, most VHD methods are based on the closed world assumption, i.e., a fixed number of highlight ...
0
Metrics
Total Citations0
research-article
February 2024
ViSTec: video modeling for sports technique recognition and tactical analysis
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 944, Pages 8490–8498https://doi.org/10.1609/aaai.v38i8.28692

The immense popularity of racket sports has fueled substantial demand in tactical analysis with broadcast videos. However, existing manual methods require laborious annotation, and recent attempts leveraging video perception models are limited to low-...
0
Metrics
Total Citations0
research-article
February 2024
Spatio-temporal fusion for human action recognition via Joint Trajectory Graph
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 842, Pages 7579–7587https://doi.org/10.1609/aaai.v38i7.28590

Graph Convolutional Networks (GCNs) and Transformers have been widely applied to skeleton-based human action recognition, with each offering unique advantages in capturing spatial relationships and long-range dependencies. However, for most GCN methods, ...
0
Metrics
Total Citations0
research-article
February 2024
TF-CLIP: learning text-free CLIP for video-based person re-identification
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 752, Pages 6764–6772https://doi.org/10.1609/aaai.v38i7.28500

Large-scale language-image pre-trained models (e.g., CLIP) have shown superior performances on many cross-modal retrieval tasks. However, the problem of transferring the knowledge learned from such models to video-based person reidentification (ReID) has ...
1
Metrics
Total Citations1
research-article
February 2024
Referred by multi-modality: a unified temporal transformer for video object segmentation
- Shilin Yan,
- Renrui Zhang,
- Ziyu Guo,
- Wenchao Chen,
- Wei Zhang,
- Hongyang Li,
- Yu Qiao,
- Hao Dong,
- Zhongjiang He,
- Peng Gao
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 717, Pages 6449–6457https://doi.org/10.1609/aaai.v38i6.28465

Recently, video object segmentation (VOS) referred by multimodal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual ...
0
Metrics
Total Citations0
research-article
February 2024
MuLTI: efficient video-and-language understanding with text-guided MultiWay-sampler and multiple choice modeling
- Jiaqi Xu,
- Bo Liu,
- Yunkuo Chen,
- Mengli Cheng,
- Xing Shi
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 700, Pages 6297–6305https://doi.org/10.1609/aaai.v38i6.28448

Video-and-language understanding has a variety of applications in the industry, such as video question answering, textvideo retrieval, and multi-label classification. Existing video-and-language understanding methods generally adopt heavy multi-modal ...
0
Metrics
Total Citations0
research-article
February 2024
Dynamic semantic-based spatial graph convolution network for skeleton-based human action recognition
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 692, Pages 6225–6233https://doi.org/10.1609/aaai.v38i6.28440

Graph convolutional networks (GCNs) have attracted great attention and achieved remarkable performance in skeleton-based action recognition. However, most of the previous works are designed to refine skeleton topology without considering the types of ...
0
Metrics
Total Citations0
research-article
February 2024
Temporal correlation vision transformer for video person re-identification
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 676, Pages 6083–6091https://doi.org/10.1609/aaai.v38i6.28424

Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To ...
0
Metrics
Total Citations0
research-article
February 2024
GMMFormer: gaussian-mixture-model based transformer for efficient partially relevant video retrieval
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 641, Pages 5767–5775https://doi.org/10.1609/aaai.v38i6.28389

Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR ...
0
Metrics
Total Citations0
research-article
February 2024
Prompting segmentation with sound is generalizable audio-visual source localizer
- Yaoting Wang,
- Weisong Liu,
- Guangyao Li,
- Jian Ding,
- Di Hu,
- Xi Li
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 630, Pages 5669–5677https://doi.org/10.1609/aaai.v38i6.28378

Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding ...
0
Metrics
Total Citations0
research-article
February 2024
CoVR: learning composed video retrieval from web video captions
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 586, Pages 5270–5279https://doi.org/10.1609/aaai.v38i6.28334

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-...
2
Metrics
Total Citations2
research-article
February 2024
Open-vocabulary video relation extraction
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 580, Pages 5215–5223https://doi.org/10.1609/aaai.v38i6.28328

A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and ...
0
Metrics
Total Citations0
research-article
February 2024
Towards efficient and effective text-to-video retrieval with coarse-to-fine visual representation learning
- Kaibin Tian,
- Yanhua Cheng,
- Yi Liu,
- Xinglin Hou,
- Quan Chen,
- Han Li
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial IntelligenceArticle No.: 579, Pages 5207–5214https://doi.org/10.1609/aaai.v38i6.28327

In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut of visual and textual cues to achieve alignment. Concretely, those methods with ...
0
Metrics
Total Citations0

Applied Filters

Publication Date

People

Authors

Institutions

Publications

All Publications

Content Type

Publisher

Results

Zero-shot aerial object detection with visual description regularization

Diverse and aligned audio-to-video generation via text-to-video model adaptation

Multi-modal prompting for open-vocabulary video visual relationship detection

A multimodal, multi-task adapting framework for video action recognition

Bias-conflict sample synthesis and adversarial removal debias strategy fo temporal sentence grounding in video

Towards balanced alignment: modal-enhanced semantic modeling for video moment retrieval

TD²-net: toward denoising and debiasing for dynamic scene graph generation

Exploring domain incremental video highlights detection with the LiveFood benchmark

ViSTec: video modeling for sports technique recognition and tactical analysis

Spatio-temporal fusion for human action recognition via Joint Trajectory Graph

TF-CLIP: learning text-free CLIP for video-based person re-identification

Referred by multi-modality: a unified temporal transformer for video object segmentation

MuLTI: efficient video-and-language understanding with text-guided MultiWay-sampler and multiple choice modeling

Dynamic semantic-based spatial graph convolution network for skeleton-based human action recognition

Temporal correlation vision transformer for video person re-identification

GMMFormer: gaussian-mixture-model based transformer for efficient partially relevant video retrieval

Prompting segmentation with sound is generalizable audio-visual source localizer

CoVR: learning composed video retrieval from web video captions

Open-vocabulary video relation extraction

Towards efficient and effective text-to-video retrieval with coarse-to-fine visual representation learning