Abstract
Semi-supervised video object segmentation aims to segment the object in the video when only the annotated mask of the first frame is given. Recently, memory-based methods have attracted increasing attention with significant performance improvements. However, these methods employ pixel-level matching according to the similarity without considering the trajectory and the feature of the object, which may result in mismatching between the object and non-object region in complex scenarios. To relieve this problem, we propose spatial and temporal guidance for semi-supervised video object segmentation. The proposed method takes into account the consistency of the object in spatiotemporal domain and employs global matching to conduct pixel-level matching. Moreover, we design the spatial guidance module (SGM) to track the trajectory of the object. And we design the temporal guidance module (TGM) to focus on long-term object-level feature from the first frame. The proposed spatial and temporal guidance effectively alleviates mismatching and makes the model more robust and efficient. Experiments on YouTube-VOS and DAVIS benchmarks show that our method outperforms previous state-of-the-art methods with a fast inference speed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bhat, G., et al.: Learning what to learn for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 777–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_46
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR, pp. 221–230 (2017)
Chen, X., Li, Z., Yuan, Y., Yu, G., Shen, J., Qi, D.: State-aware tracker for real-time video object segmentation. In: CVPR, pp. 9384–9393 (2020)
Cheng, H.K., Chung, J., Tai, Y.W., Tang, C.K.: CascadePSP: toward class-agnostic and very high-resolution segmentation via global and local refinement. In: CVPR, pp. 8890–8899 (2020)
Cheng, H.K., Tai, Y.W., Tang, C.K.: Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In: CVPR, pp. 5559–5568 (2021)
Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In: NIPS (2021)
Duke, B., Ahmed, A., Wolf, C., et al.: SSTVOS: sparse spatiotemporal transformers for video object segmentation. In: CVPR, pp. 5912–5921 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Hu, L., Zhang, P., Zhang, B., et al.: Learning position and target consistency for memory-based video object segmentation. In: CVPR, pp. 4144–4154 (2021)
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: Criss-Cross attention for semantic segmentation. In: ICCV, pp. 603–612 (2019)
Li, X., Wei, T., Chen, Y.P., Tai, Y.W., Tang, C.K.: FSS-1000: A 1000-class dataset for few-shot segmentation. In: CVPR, pp. 2869–2878 (2020)
Li, Yu., Shen, Z., Shan, Y.: Fast video object segmentation using the global context module. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 735–750. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_43
Liang, Y., Li, X., Jafari, N., Chen, J.: Video object segmentation with adaptive feature bank and uncertain-region refinement. In: NIPS, vol. 33, pp. 3430–3441 (2020)
Lu, X., Wang, W., Danelljan, M., Zhou, T., Shen, J., Van Gool, L.: Video object segmentation with episodic graph memory networks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 661–679. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_39
Luiten, J., Voigtlaender, P., Leibe, B.: PReMVOS: proposal-generation, refinement and merging for video object segmentation. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11364, pp. 565–580. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20870-7_35
Maninis, K.K.: Video object segmentation without temporal information. TPAMI 41(6), 1515–1530 (2018)
Oh, S.W., Lee, J.Y., Sunkavalli, K., Kim, S.J.: Fast video object segmentation by reference-guided mask propagation. In: CVPR, pp. 7376–7385 (2018)
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV, pp. 9226–9235 (2019)
Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: CVPR, pp. 2663–2672 (2017)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR, pp. 724–732 (2016)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Robinson, A., Lawin, F.J., Danelljan, M., et al.: Learning fast and robust target models for video object segmentation. In: CVPR, pp. 7406–7415 (2020)
Seong, H., Hyun, J., Kim, E.: Kernelized memory network for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 629–645. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_38
Shi, J., Yan, Q., Xu, L., Jia, J.: Hierarchical image saliency detection on extended CSSD. TPAMI 38(4), 717–729 (2015)
Ventura, C., Bellver, M., Girbau, A., Salvador, A., Marques, F., Giro-i Nieto, X.: RVOS: end-to-end recurrent network for video object segmentation. In: CVPR, pp. 5277–5286 (2019)
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: FEELVOS: fast end-to-end embedding learning for video object segmentation. In: CVPR, pp. 9481–9490 (2019)
Wang, H., Jiang, X., Ren, H., Hu, Y., Bai, S.: SwiftNet: real-time video object segmentation. In: CVPR, pp. 1296–1305 (2021)
Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., et al.: Learning to detect salient objects with image-level supervision. In: CVPR, pp. 136–145 (2017)
Xie, H., Yao, H., Zhou, S., Zhang, S., Sun, W.: Efficient regional memory network for video object segmentation. In: CVPR, pp. 1286–1295 (2021)
Xu, N., et al.: YouTube-VOS: sequence-to-sequence video object segmentation. In: ECCV, pp. 585–601 (2018)
Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., et al.: YouTube-VOS: a large-scale video object segmentation benchmark. In: ECCV, pp. 585–601 (2018)
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR, pp. 6499–6507 (2018)
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 332–348. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_20
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by multi-scale foreground-background integration. TPAMI 49, 4701–4712 (2021)
Zeng, Y., Zhang, P., Zhang, J., Lin, Z., Lu, H.: Towards high-resolution salient object detection. In: ICCV, pp. 7234–7243 (2019)
Zhang, Y., Wu, Z., Peng, H., Lin, S.: A transductive approach for video object segmentation. In: CVPR, pp. 6949–6958 (2020)
Acknowledgements
This work was supported by the National Natural Science Foundation of China (61972059, 61773272, 62102347), China Postdoctoral Science Foundation (2021M69236), Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University (93K172017K18), Natural Science Foundation of Jiangsu Province under Grant (BK20191474, BK20191475, BK20161268), Qinglan Project of Jiangsu Province (No. 2020).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, G., Gong, S., Zhong, S., Zhou, L. (2023). Spatial and Temporal Guidance for Semi-supervised Video Object Segmentation. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13625. Springer, Cham. https://doi.org/10.1007/978-3-031-30111-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-30111-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30110-0
Online ISBN: 978-3-031-30111-7
eBook Packages: Computer ScienceComputer Science (R0)