research-article

Temporally Efficient Gabor Transformer for Unsupervised Video Object Segmentation

Authors:

Jiaqing Fan,

Tiankang Su,

Kaihua Zhang,

Bo Liu,

Qingshan LiuAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 3394 - 3402

https://doi.org/10.1145/3581783.3612017

Published: 27 October 2023 Publication History

Get Access

Abstract

Spatial-temporal structural details of targets in video (e.g. varying edges, textures over time) are essential to accurate Unsupervised Video Object Segmentation (UVOS). The vanilla multi-head self-attention in the Transformer-based UVOS methods usually concentrates on learning the general low-frequency information (e.g. illumination, color), while neglecting the high-frequency texture details, leading to unsatisfying segmentation results. To address this issue, this paper presents a Temporally efficient Gabor Transformer (TGFormer) for UVOS. The TGFormer jointly models the spatial dependencies and temporal coherence intra- and inter-frames, which can fully capture the rich structural details for accurate UVOS. Concretely, we first propose an effective learnable Gabor filtering Transformer to mine the structural texture details of the object for accurate UVOS. Then, to adaptively store the redundant neighboring historical information, we present an efficient dynamic neighboring frame selection module to automatically choose the useful temporal information, which simultaneously relieves the blurry frame and reduces the computation burden. Finally, we make the UVOS model be a fully Transformer architecture, meanwhile aggregating the information from space, Gabor and time domains, yielding a strong representation with rich structure details. Extensive experiments on five mainstream UVOS benchmarks (DAVIS2016, FBMS, DAVSOD, ViSal, and MCL) demonstrate the superiority of the presented solution to sate-of-the-art methods.

Supplemental Material

MP4 File

This paper presents a Temporally efficient Gabor Transformer (TGFormer) for UVOS. The TGFormer jointly models the spatial dependencies and temporal coherence intra- and inter-frames, which can fully capture the rich structural details for accurate UVOS. Extensive experiments on five mainstream UVOS benchmarks (DAVIS2016, FBMS, DAVSOD, ViSal, and MCL) demonstrate the superiority of the presented solution to state-of-the-art methods.

Download
77.74 MB

References

[1]

Yael Adini, Yael Moses, and Shimon Ullman. 1997. Face recognition: The problem of compensating for changes in illumination direction. TPAMI (1997).

Abstract

Supplemental Material

References

Cited By

Recommendations

Gabor-Filtering-Based Completed Local Binary Patterns for Land-Use Scene Classification

MEViT: Motion Enhanced Video Transformer for Video Classification

Efficient Spatio-temporal Segmentation for Extracting Moving Objects in Video Sequences

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations