Track Everything Everywhere Fast and Robustly

Song, Yunzhou; Lei, Jiahui; Wang, Ziyun; Liu, Lingjie; Daniilidis, Kostas

doi:10.1007/978-3-031-72646-0_20

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15061))

Included in the following conference series:

European Conference on Computer Vision

190 Accesses

Abstract

We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than 10 times faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.

Y. Song, J. Lei—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DeepTAM: Deep Tracking and Mapping

GeoRefine: Self-supervised Online Depth Refinement for Accurate Dense Mapping

Self-Supervised Monocular Depth and Motion Learning in Dynamic Scenes: Semantic Prior to Rescue

Article 19 July 2022

References

Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)
Article Google Scholar
Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: ZoeDepth: zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)
Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multiscale contrastive random walks. In: CVPR, pp. 6508–6519 (2022)
Google Scholar
Birkl, R., Wofk, D., Müller, M.: Midas v3. 1–A model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460 (2023)
Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: TensoRF: tensorial radiance fields. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, vol. 13692, pp. 333–350. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_20
DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236 (2018)
Google Scholar
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. arXiv preprint arXiv:1605.08803 (2016)
Doersch, C., et al.: TAP-VID: a benchmark for tracking any point in a video. Adv. Neural. Inf. Process. Syst. 35, 13610–13626 (2022)
Google Scholar
Doersch, C., et al.: TAPIR: tracking any point with per-frame initialization and temporal refinement. arXiv preprint arXiv:2306.08637 (2023)
Guizilini, V., Vasiljevic, I., Chen, D., Ambrus, R., Gaidon, A.: Towards zero-shot scale-aware monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9233–9243 (2023)
Google Scholar
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: tracking through occlusions using point trajectories. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, vol. 13682, pp. 59–75. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_4
Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1–3), 185–203 (1981)
Article Google Scholar
Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9772–9781 (2021)
Google Scholar
Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: CoTracker: it is better to track together. arXiv preprint arXiv:2307.07635 (2023)
Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Google Scholar
Lee, A.X., et al.: Beyond pick-and-place: tackling robotic stacking of diverse shapes. In: 5th Annual Conference on Robot Learning (2021)
Google Scholar
Lei, J., Daniilidis, K.: CaDeX: learning canonical deformation coordinate space for dynamic surface representation via neural homeomorphism. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6624–6634 (2022)
Google Scholar
Li, W.: Superglue-based deep learning method for image matching from multiple viewpoints. In: Proceedings of the 2023 8th International Conference on Mathematics and Artificial Intelligence, pp. 53–58 (2023)
Google Scholar
Liu, L., Gu, J., Lin, K.Z., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. In: NeurIPS (2020)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004)
Article Google Scholar
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI 1981: 7th International Joint Conference on Artificial Intelligence, vol. 2, pp. 674–679 (1981)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NERF: representing scenes as neural radiance fields for view synthesis. In: ECCV (2020)
Google Scholar
Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (ToG) 41(4), 1–15 (2022)
Article Google Scholar
Neoral, M., Šerỳch, J., Matas, J.: MFT: long-term tracking of every pixel. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6837–6847 (2024)
Google Scholar
Oquab, M., et al.: DINOv2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 Davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-NERF: neural radiance fields for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318–10327 (2021)
Google Scholar
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: An efficient alternative to SIFT or SURF. In: 2011 International Conference on Computer Vision, pp. 2564–2571. IEEE (2011)
Google Scholar
Sand, P., Teller, S.: Particle video: long-range motion estimation using point trajectories. Int. J. Comput. Vision 80, 72–91 (2008)
Article Google Scholar
Shi, J., Tomasi, C.: Good features to track. In: Computer Vision and Pattern Recognition, 1994. Proceedings CVPR 1994., 1994 IEEE Computer Society Conference on, pp. 593–600. IEEE (1994)
Google Scholar
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2018)
Google Scholar
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: detector-free local feature matching with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8922–8931 (2021)
Google Scholar
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Wang, Q., et al.: Tracking everything everywhere all at once. arXiv preprint arXiv:2306.05422 (2023)
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2566–2576 (2019)
Google Scholar
Xian, K., et al.: Monocular relative depth perception with web stereo data supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Xu, H., Yang, J., Cai, J., Zhang, J., Tong, X.: High-resolution optical flow from 1D attention and correlation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10498–10507 (2021)
Google Scholar
Xu, H., Zhang, J., Cai, J., Rezatofighi, H., Tao, D.: GMFlow: learning optical flow via global matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8121–8130 (2022)
Google Scholar
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: unleashing the power of large-scale unlabeled data. In: CVPR (2024)
Google Scholar
Ye, V., Li, Z., Tucker, R., Kanazawa, A., Snavely, N.: Deformable sprites for unsupervised video decomposition. In: CVPR, pp. 2657–2666 (2022)
Google Scholar
Zhang, F., Woodford, O.J., Prisacariu, V.A., Torr, P.H.: Separable flow: learning motion cost volumes for optical flow estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10807–10817 (2021)
Google Scholar
Zhang, M.L., Wu, L.: Lift: multi-label learning with label-specific features. IEEE Trans. Pattern Anal. Mach. Intell. 37(1), 107–120 (2014)
Article Google Scholar
Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: a large-scale synthetic dataset for long-term point tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19855–19865 (2023)
Google Scholar

Download references

Acknowledgements

We gratefully acknowledge the financial support through the NSF IIS-RI 2212433 grant, and a gift from AWS AI to Penn Engineering’s ASSET Center for Trustworthy AI.

Author information

Authors and Affiliations

University of Pennsylvania, Philadelphia, USA
Yunzhou Song, Jiahui Lei, Ziyun Wang, Lingjie Liu & Kostas Daniilidis
Archimedes, Athena RC, Marousi, Greece
Kostas Daniilidis

Authors

Yunzhou Song
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Lei
View author publications
You can also search for this author in PubMed Google Scholar
Ziyun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lingjie Liu
View author publications
You can also search for this author in PubMed Google Scholar
Kostas Daniilidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiahui Lei .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 93612 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, Y., Lei, J., Wang, Z., Liu, L., Daniilidis, K. (2025). Track Everything Everywhere Fast and Robustly. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15061. Springer, Cham. https://doi.org/10.1007/978-3-031-72646-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-72646-0_20
Published: 28 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72645-3
Online ISBN: 978-3-031-72646-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Track Everything Everywhere Fast and Robustly

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DeepTAM: Deep Tracking and Mapping

GeoRefine: Self-supervised Online Depth Refinement for Accurate Dense Mapping

Self-Supervised Monocular Depth and Motion Learning in Dynamic Scenes: Semantic Prior to Rescue

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 93612 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Track Everything Everywhere Fast and Robustly

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

DeepTAM: Deep Tracking and Mapping

GeoRefine: Self-supervised Online Depth Refinement for Accurate Dense Mapping

Self-Supervised Monocular Depth and Motion Learning in Dynamic Scenes: Semantic Prior to Rescue

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 93612 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation