COMatchNet: Co-Attention Matching Network for Video Object Segmentation

Huang, Lufei; Sun, Fengming; Yuan, Xia

doi:10.1007/978-3-031-02375-0_20

Lufei Huang¹⁰,
Fengming Sun¹⁰ &
Xia Yuan¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13188))

Included in the following conference series:

Asian Conference on Pattern Recognition

1166 Accesses
1 Citations

Abstract

Semi-supervised video object segmentation (semi-VOS) predicts pixel-accurate masks of the target objects in all frames according to the ground truth mask provided in the first frame. A critical challenge to this task is how to model the dependency between the query frame and other frames. Most methods neglect or do not make full use of the inherent relevance. In this paper, we propose a novel network called CO-Attention Matching Network (COMatchNet) for semi-VOS. The COMatchNet mainly consists of a co-attention module and a matching module. The co-attention module extracts frame correlation among the query frame and the previous frame and the first frame. The matching module calculates pixel-level matching scores and finds the most similar regions to preceding frames in the query frame. The COMatchNet integrates two level information and generates fine-grained segmentation masks. We conduct extensive experiments on three popular video object segmentation benchmarks, i.e. DAVIS 2016; DAVIS 2017; YouTube-VOS. Our COMatchNet achieves competitive performance (J&F) of 86.8%, 75.9%, and 81.4% on the above benchmarks, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Saliency-based dual-attention network for unsupervised video object segmentation

Article 22 September 2023

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

Dual Attention Based Network with Hierarchical ConvLSTM for Video Object Segmentation

References

Bao, L., Wu, B., Liu, W.: CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5977–5986 (2018)
Google Scholar
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2544–2550. IEEE (2010)
Google Scholar
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 221–230 (2017)
Google Scholar
Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Chapter Google Scholar
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1189–1198 (2018)
Google Scholar
Fan, H., et al.: Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2019)
Google Scholar
Girgensohn, A., et al.: A semi-automatic approach to home video editing. In: Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology, pp. 81–89 (2000)
Google Scholar
Hu, Y.-T., Huang, J.-B., Schwing, A.G.: VideoMatch: matching based video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 56–73. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_4
Chapter Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv:1606.00061 (2016)
Meinhardt, T., Leal-Taixé, L.: Make one-shot video object segmentation efficient again. arXiv preprint arXiv:2012.01866 (2020)
Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7376–7385 (2018)
Google Scholar
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9226–9235 (2019)
Google Scholar
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Robbins, H., Monro, S.: A stochastic approximation method. Annals Math. Stat. 400–407 (1951)
Google Scholar
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: Feelvos: fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9481–9490 (2019)
Google Scholar
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364 (2017)
Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: Ranet: ranking attention network for fast video object segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3978–3987 (2019)
Google Scholar
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_1
Chapter Google Scholar
Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 284–293 (2019)
Google Scholar
Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., Huang, T.: Youtube-vos: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327 (2018)
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 332–348. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_20
Chapter Google Scholar
Zolfaghari, M., Singh, K., Brox, T.: ECO: efficient convolutional network for online video understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 713–730. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_43
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Nanjing University of Science and Technology, 210094, Nanjing, China
Lufei Huang, Fengming Sun & Xia Yuan

Authors

Lufei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Fengming Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xia Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xia Yuan .

Editor information

Editors and Affiliations

Korea University, Seoul, Korea (Republic of)
Christian Wallraven
Nanjing University, Nanjing, China
Qingshan Liu
Osaka University, Osaka, Japan
Hajime Nagahara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, L., Sun, F., Yuan, X. (2022). COMatchNet: Co-Attention Matching Network for Video Object Segmentation. In: Wallraven, C., Liu, Q., Nagahara, H. (eds) Pattern Recognition. ACPR 2021. Lecture Notes in Computer Science, vol 13188. Springer, Cham. https://doi.org/10.1007/978-3-031-02375-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-02375-0_20
Published: 11 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-02374-3
Online ISBN: 978-3-031-02375-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

COMatchNet: Co-Attention Matching Network for Video Object Segmentation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Saliency-based dual-attention network for unsupervised video object segmentation

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

Dual Attention Based Network with Hierarchical ConvLSTM for Video Object Segmentation

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

COMatchNet: Co-Attention Matching Network for Video Object Segmentation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Saliency-based dual-attention network for unsupervised video object segmentation

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

Dual Attention Based Network with Hierarchical ConvLSTM for Video Object Segmentation

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation