Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation Using RGB Frames and Events

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Recent advances in event-based vision suggest that they complement traditional cameras by providing continuous observation without frame rate limitations and high dynamic range which are well-suited for correspondence tasks such as optical flow and point tracking. However, so far there is still a lack of comprehensive benchmarks for correspondence tasks with both event data and images. To fill this gap, we propose BlinkVision, a large-scale and diverse benchmark with rich modality and dense annotation of correspondence. BlinkVision has several appealing properties: 1) Rich modalities: It encompasses both event data and RGB images. 2) Rich annotations: It provides dense per-pixel annotations covering optical flow, scene flow, and point tracking. 3) Large vocabulary: It incorporates 410 daily categories, sharing common classes with widely-used 2D and 3D datasets such as LVIS and ShapeNet. 4) Naturalistic: It delivers photorealism data and covers a variety of naturalistic factors such as camera shake and deformation. BlinkVision enables extensive benchmarks on three types of correspondence tasks (i.e., optical flow, point tracking and scene flow estimation) for both image-based methods and event-based methods, leading to new observations, practices, and insights for future research. The benchmark website is https://www.blinkvision.net/.

Y. Li, Y. Shen and Z. Huang—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. ActorCore. https://actorcore.reallusion.com/. Accessed 17 Nov 2023

  2. Blender. https://www.blender.org/. Accessed 17 Nov 2023

  3. Evermotion Archinteriors Collection. https://evermotion.org/. Accessed 11 Nov 2023

  4. Mixamo. https://www.mixamo.com/. Accessed 17 Nov 2023

  5. Alzugaray, I., Chli, M.: Asynchronous multi-hypothesis tracking of features with event cameras. In: 2019 International Conference on 3D Vision (3DV), pp. 269–278. IEEE (2019)

    Google Scholar 

  6. Alzugaray, I., Chli, M.: Haste: multi-hypothesis asynchronous speeded-up tracking of events. In: 31st British Machine Vision Virtual Conference (BMVC 2020), p. 744. ETH Zurich, Institute of Robotics and Intelligent Systems (2020)

    Google Scholar 

  7. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. Int. J. Comput. Vis. 12, 43–77 (1994)

    Article  Google Scholar 

  8. Bian, W., Huang, Z., Shi, X., Dong, Y., Li, Y., Li, H.: Context-TAP: tracking any point demands spatial context features. arXiv preprint arXiv:2306.02000 (2023)

  9. Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012 Part VI. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44

    Chapter  Google Scholar 

  10. Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)

  11. Ding, Z., et al.: Spatio-temporal recurrent networks for event-based optical flow estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 525–533 (2022)

    Google Scholar 

  12. Doersch, C., et al.: Tap-vid: a benchmark for tracking any point in a video. In: Advance in Neural Information Processing System, vol. 35, pp. 13610–13626 (2022)

    Google Scholar 

  13. Gallego, G., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2020)

    Article  Google Scholar 

  14. Gehrig, D., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Video to events: recycling video datasets for event cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3586–3595 (2020)

    Google Scholar 

  15. Gehrig, M., Aarents, W., Gehrig, D., Scaramuzza, D.: DSEC: a stereo event camera dataset for driving scenarios. IEEE Robot. Autom. Lett. 6(3), 4947–4954 (2021)

    Article  Google Scholar 

  16. Gehrig, M., Millhäusler, M., Gehrig, D., Scaramuzza, D.: E-RAFT: dense optical flow from event cameras. In: Proceedings of the International Conference on 3D Vision, pp. 197–206. IEEE (2021)

    Google Scholar 

  17. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)

    Google Scholar 

  18. Greff, Ket al.: Kubric: a scalable dataset generator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3749–3761 (2022)

    Google Scholar 

  19. Grossberg, M.D., Nayar, S.K.: What is the space of camera response functions? In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings, vol. 2, pp. II–602. IEEE (2003)

    Google Scholar 

  20. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019)

    Google Scholar 

  21. Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: tracking through occlusions using point trajectories. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 59–75. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_4

    Chapter  Google Scholar 

  22. Hidalgo-Carrió, J., Gallego, G., Scaramuzza, D.: Event-aided direct sparse odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5781–5790 (2022)

    Google Scholar 

  23. Hu, J., et al.: CG-SLAM: efficient dense RGB-D SLAM in a consistent uncertainty-aware 3D gaussian field. arXiv preprint arXiv:2403.16095 (2024)

  24. Hu, J., Mao, M., Bao, H., Zhang, G., Cui, Z.: CP-SLAM: collaborative neural point-based slam system. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

  25. Huang, Z., et al.: Neuralmarker: a framework for learning general marker correspondence. ACM Trans. Graph. (TOG) 41(6), 1–10 (2022)

    Article  Google Scholar 

  26. Huang, Z., et al.: Flowformer: a transformer architecture for optical flow. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13677, pp. 668–685. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_40

    Chapter  Google Scholar 

  27. Huang, Z., et al.: FlowFormer: a transformer architecture and its masked cost volume autoencoding for optical flow. arXiv preprint arXiv:2306.05442 (2023)

  28. Huang, Z., et al.: Vs-net: voting with segmentation for visual localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6101–6111 (2021)

    Google Scholar 

  29. Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9772–9781 (2021)

    Google Scholar 

  30. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  31. Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: 2007 6th IEEE and ACM International Symposium on Mixed and Augmented reality, pp. 225–234. IEEE (2007)

    Google Scholar 

  32. Klenk, S., Chui, J., Demmel, N., Cremers, D.: Tum-vie: the tum stereo visual-inertial event dataset. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8601–8608. IEEE (2021)

    Google Scholar 

  33. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)

    Google Scholar 

  34. Li, Y., et al.: Blinkflow: a dataset to push the limits of event-based optical flow estimation. arXiv preprint arXiv:2303.07716 (2023)

  35. Li, Y., et al.: DELTAR: depth estimation from a light-weight ToF sensor and RGB image. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13661, pp. 619–636. Springer, Cham (2022)

    Google Scholar 

  36. Li, Y., et al.: Graph-based asynchronous event processing for rapid object recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 934–943 (2021)

    Google Scholar 

  37. Lin, S., Ma, Y., Guo, Z., Wen, B.: DVS-voltmeter: stochastic process-based event simulator for dynamic vision sensors. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13667, pp. 578–593. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_34

    Chapter  Google Scholar 

  38. Liu, H., Lu, T., Xu, Y., Liu, J., Li, W., Chen, L.: Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5791–5801 (2022)

    Google Scholar 

  39. Liu, X., et al.: Multi-modal neural radiance field for monocular dense slam with a light-weight ToF sensor. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–11 (2023)

    Google Scholar 

  40. Liu, Y.L., et al.: Single-image HDR reconstruction by learning to reverse the camera pipeline. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1651–1660 (2020)

    Google Scholar 

  41. Luo, J., Huang, Z., Li, Y., Zhou, X., Zhang, G., Bao, H.: NIID-Net: adapting surface normal knowledge for intrinsic image decomposition in indoor scenes. IEEE Trans. Visual Comput. Graph. 26(12), 3434–3445 (2020)

    Article  Google Scholar 

  42. Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)

    Google Scholar 

  43. Mehl, L., Schmalfuss, J., Jahedi, A., Nalivayko, Y., Bruhn, A.: Spring: a high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4981–4991 (2023)

    Google Scholar 

  44. Messikommer, N., Fang, C., Gehrig, M., Scaramuzza, D.: Data-driven feature tracking for event cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5642–5651 (2023)

    Google Scholar 

  45. Milner, D., Goodale, M.: The Visual Brain in Action, vol. 27. OUP, Oxford (2006)

    Book  Google Scholar 

  46. Mueggler, E., Rebecq, H., Gallego, G., Delbruck, T., Scaramuzza, D.: The event-camera dataset and simulator: event-based data for pose estimation, visual odometry, and slam. Int. J. Robot. Res. 36(2), 142–149 (2017)

    Article  Google Scholar 

  47. Ni, J., et al.: PATS: patch area transportation with subdivision for local feature matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 17776–17786 (2023)

    Google Scholar 

  48. Pan, L., Scheerlinck, C., Yu, X., Hartley, R., Liu, M., Dai, Y.: Bringing a blurry frame alive at high frame-rate with an event camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6820–6829 (2019)

    Google Scholar 

  49. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)

    Google Scholar 

  50. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  51. Rebecq, H., Gehrig, D., Scaramuzza, D.: ESIM: an open event camera simulator. In: Proceedings of the Conference on Robot Learning, pp. 969–982. PMLR (2018)

    Google Scholar 

  52. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to sift or surf. In: 2011 International Conference on Computer Vision, pp. 2564–2571. IEEE (2011)

    Google Scholar 

  53. Rueckauer, B., Delbruck, T.: Evaluation of event-based algorithms for optical flow with ground-truth from inertial measurement sensor. Front. Neurosci. 10, 176 (2016)

    Article  Google Scholar 

  54. Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113 (2016)

    Google Scholar 

  55. Shi, X., et al.: Flowformer++: masked cost volume autoencoding for pretraining optical flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1599–1610 (2023)

    Google Scholar 

  56. Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24

    Chapter  Google Scholar 

  57. Teed, Z., Deng, J.: RAFT-3D: scene flow using rigid-motion embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8375–8384 (2021)

    Google Scholar 

  58. Wan, Z., Dai, Y., Mao, Y.: Learning dense and continuous optical flow from an event camera. IEEE Trans. Image Process. 31, 7237–7251 (2022)

    Article  Google Scholar 

  59. Wan, Z., Mao, Y., Zhang, J., Dai, Y.: RPEFlow: mltimodal fusion of RGB-Pointcloud-event for joint optical flow and scene flow estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10030–10040 (2023)

    Google Scholar 

  60. Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2955–2966 (2023)

    Google Scholar 

  61. Yang, B., et al.: Hybrid3d: learning 3D hybrid features with point clouds and multi-view images for point cloud registration. Sci. China Inf. Sci. 66(7), 172101 (2023)

    Article  Google Scholar 

  62. Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: Pointodyssey: a large-scale synthetic dataset for long-term point tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19855–19865 (2023)

    Google Scholar 

  63. Zhu, A.Z., Thakur, D., Özaslan, T., Pfrommer, B., Kumar, V., Daniilidis, K.: The multivehicle stereo event camera dataset: an event camera dataset for 3D perception. IEEE Robot. Autom. Lett. 3(3), 2032–2039 (2018)

    Article  Google Scholar 

  64. Zhu, A.Z., Yuan, L., Chaney, K., Daniilidis, K.: EV-flownet: self-supervised optical flow estimation for event-based cameras. In: Kress-Gazit, H., Srinivasa, S.S., Howard, T., Atanasov, N. (eds.) Robotics: Science and Systems (2018)

    Google Scholar 

Download references

Acknowledgment

This project was funded in part by National Key R&D Program of China Project 2022ZD0161100, by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK, by Smart Traffic Fund PSRI/76/2311/PR, by RGC General Research Fund Project 14204021. Hongsheng Li is a PI of CPII under the InnoHK. This work was also partially supported by NSF of China (No. 61932003).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Guofeng Zhang or Hongsheng Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13492 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y. et al. (2025). BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation Using RGB Frames and Events. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15125. Springer, Cham. https://doi.org/10.1007/978-3-031-72855-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72855-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72854-9

  • Online ISBN: 978-3-031-72855-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics