Abstract
Recent advances in event-based vision suggest that they complement traditional cameras by providing continuous observation without frame rate limitations and high dynamic range which are well-suited for correspondence tasks such as optical flow and point tracking. However, so far there is still a lack of comprehensive benchmarks for correspondence tasks with both event data and images. To fill this gap, we propose BlinkVision, a large-scale and diverse benchmark with rich modality and dense annotation of correspondence. BlinkVision has several appealing properties: 1) Rich modalities: It encompasses both event data and RGB images. 2) Rich annotations: It provides dense per-pixel annotations covering optical flow, scene flow, and point tracking. 3) Large vocabulary: It incorporates 410 daily categories, sharing common classes with widely-used 2D and 3D datasets such as LVIS and ShapeNet. 4) Naturalistic: It delivers photorealism data and covers a variety of naturalistic factors such as camera shake and deformation. BlinkVision enables extensive benchmarks on three types of correspondence tasks (i.e., optical flow, point tracking and scene flow estimation) for both image-based methods and event-based methods, leading to new observations, practices, and insights for future research. The benchmark website is https://www.blinkvision.net/.
Y. Li, Y. Shen and Z. Huang—Contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
ActorCore. https://actorcore.reallusion.com/. Accessed 17 Nov 2023
Blender. https://www.blender.org/. Accessed 17 Nov 2023
Evermotion Archinteriors Collection. https://evermotion.org/. Accessed 11 Nov 2023
Mixamo. https://www.mixamo.com/. Accessed 17 Nov 2023
Alzugaray, I., Chli, M.: Asynchronous multi-hypothesis tracking of features with event cameras. In: 2019 International Conference on 3D Vision (3DV), pp. 269–278. IEEE (2019)
Alzugaray, I., Chli, M.: Haste: multi-hypothesis asynchronous speeded-up tracking of events. In: 31st British Machine Vision Virtual Conference (BMVC 2020), p. 744. ETH Zurich, Institute of Robotics and Intelligent Systems (2020)
Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. Int. J. Comput. Vis. 12, 43–77 (1994)
Bian, W., Huang, Z., Shi, X., Dong, Y., Li, Y., Li, H.: Context-TAP: tracking any point demands spatial context features. arXiv preprint arXiv:2306.02000 (2023)
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012 Part VI. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Ding, Z., et al.: Spatio-temporal recurrent networks for event-based optical flow estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 525–533 (2022)
Doersch, C., et al.: Tap-vid: a benchmark for tracking any point in a video. In: Advance in Neural Information Processing System, vol. 35, pp. 13610–13626 (2022)
Gallego, G., et al.: Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 154–180 (2020)
Gehrig, D., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Video to events: recycling video datasets for event cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3586–3595 (2020)
Gehrig, M., Aarents, W., Gehrig, D., Scaramuzza, D.: DSEC: a stereo event camera dataset for driving scenarios. IEEE Robot. Autom. Lett. 6(3), 4947–4954 (2021)
Gehrig, M., Millhäusler, M., Gehrig, D., Scaramuzza, D.: E-RAFT: dense optical flow from event cameras. In: Proceedings of the International Conference on 3D Vision, pp. 197–206. IEEE (2021)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Greff, Ket al.: Kubric: a scalable dataset generator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3749–3761 (2022)
Grossberg, M.D., Nayar, S.K.: What is the space of camera response functions? In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings, vol. 2, pp. II–602. IEEE (2003)
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019)
Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: tracking through occlusions using point trajectories. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13682, pp. 59–75. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_4
Hidalgo-Carrió, J., Gallego, G., Scaramuzza, D.: Event-aided direct sparse odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5781–5790 (2022)
Hu, J., et al.: CG-SLAM: efficient dense RGB-D SLAM in a consistent uncertainty-aware 3D gaussian field. arXiv preprint arXiv:2403.16095 (2024)
Hu, J., Mao, M., Bao, H., Zhang, G., Cui, Z.: CP-SLAM: collaborative neural point-based slam system. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Huang, Z., et al.: Neuralmarker: a framework for learning general marker correspondence. ACM Trans. Graph. (TOG) 41(6), 1–10 (2022)
Huang, Z., et al.: Flowformer: a transformer architecture for optical flow. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13677, pp. 668–685. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_40
Huang, Z., et al.: FlowFormer: a transformer architecture and its masked cost volume autoencoding for optical flow. arXiv preprint arXiv:2306.05442 (2023)
Huang, Z., et al.: Vs-net: voting with segmentation for visual localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6101–6111 (2021)
Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9772–9781 (2021)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: 2007 6th IEEE and ACM International Symposium on Mixed and Augmented reality, pp. 225–234. IEEE (2007)
Klenk, S., Chui, J., Demmel, N., Cremers, D.: Tum-vie: the tum stereo visual-inertial event dataset. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 8601–8608. IEEE (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Li, Y., et al.: Blinkflow: a dataset to push the limits of event-based optical flow estimation. arXiv preprint arXiv:2303.07716 (2023)
Li, Y., et al.: DELTAR: depth estimation from a light-weight ToF sensor and RGB image. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13661, pp. 619–636. Springer, Cham (2022)
Li, Y., et al.: Graph-based asynchronous event processing for rapid object recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 934–943 (2021)
Lin, S., Ma, Y., Guo, Z., Wen, B.: DVS-voltmeter: stochastic process-based event simulator for dynamic vision sensors. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13667, pp. 578–593. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20071-7_34
Liu, H., Lu, T., Xu, Y., Liu, J., Li, W., Chen, L.: Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5791–5801 (2022)
Liu, X., et al.: Multi-modal neural radiance field for monocular dense slam with a light-weight ToF sensor. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1–11 (2023)
Liu, Y.L., et al.: Single-image HDR reconstruction by learning to reverse the camera pipeline. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1651–1660 (2020)
Luo, J., Huang, Z., Li, Y., Zhou, X., Zhang, G., Bao, H.: NIID-Net: adapting surface normal knowledge for intrinsic image decomposition in indoor scenes. IEEE Trans. Visual Comput. Graph. 26(12), 3434–3445 (2020)
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)
Mehl, L., Schmalfuss, J., Jahedi, A., Nalivayko, Y., Bruhn, A.: Spring: a high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4981–4991 (2023)
Messikommer, N., Fang, C., Gehrig, M., Scaramuzza, D.: Data-driven feature tracking for event cameras. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5642–5651 (2023)
Milner, D., Goodale, M.: The Visual Brain in Action, vol. 27. OUP, Oxford (2006)
Mueggler, E., Rebecq, H., Gallego, G., Delbruck, T., Scaramuzza, D.: The event-camera dataset and simulator: event-based data for pose estimation, visual odometry, and slam. Int. J. Robot. Res. 36(2), 142–149 (2017)
Ni, J., et al.: PATS: patch area transportation with subdivision for local feature matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 17776–17786 (2023)
Pan, L., Scheerlinck, C., Yu, X., Hartley, R., Liu, M., Dai, Y.: Bringing a blurry frame alive at high frame-rate with an event camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6820–6829 (2019)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 724–732 (2016)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Rebecq, H., Gehrig, D., Scaramuzza, D.: ESIM: an open event camera simulator. In: Proceedings of the Conference on Robot Learning, pp. 969–982. PMLR (2018)
Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: ORB: an efficient alternative to sift or surf. In: 2011 International Conference on Computer Vision, pp. 2564–2571. IEEE (2011)
Rueckauer, B., Delbruck, T.: Evaluation of event-based algorithms for optical flow with ground-truth from inertial measurement sensor. Front. Neurosci. 10, 176 (2016)
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113 (2016)
Shi, X., et al.: Flowformer++: masked cost volume autoencoding for pretraining optical flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1599–1610 (2023)
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Teed, Z., Deng, J.: RAFT-3D: scene flow using rigid-motion embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8375–8384 (2021)
Wan, Z., Dai, Y., Mao, Y.: Learning dense and continuous optical flow from an event camera. IEEE Trans. Image Process. 31, 7237–7251 (2022)
Wan, Z., Mao, Y., Zhang, J., Dai, Y.: RPEFlow: mltimodal fusion of RGB-Pointcloud-event for joint optical flow and scene flow estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10030–10040 (2023)
Xu, J., Liu, S., Vahdat, A., Byeon, W., Wang, X., De Mello, S.: Open-vocabulary panoptic segmentation with text-to-image diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2955–2966 (2023)
Yang, B., et al.: Hybrid3d: learning 3D hybrid features with point clouds and multi-view images for point cloud registration. Sci. China Inf. Sci. 66(7), 172101 (2023)
Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: Pointodyssey: a large-scale synthetic dataset for long-term point tracking. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19855–19865 (2023)
Zhu, A.Z., Thakur, D., Özaslan, T., Pfrommer, B., Kumar, V., Daniilidis, K.: The multivehicle stereo event camera dataset: an event camera dataset for 3D perception. IEEE Robot. Autom. Lett. 3(3), 2032–2039 (2018)
Zhu, A.Z., Yuan, L., Chaney, K., Daniilidis, K.: EV-flownet: self-supervised optical flow estimation for event-based cameras. In: Kress-Gazit, H., Srinivasa, S.S., Howard, T., Atanasov, N. (eds.) Robotics: Science and Systems (2018)
Acknowledgment
This project was funded in part by National Key R&D Program of China Project 2022ZD0161100, by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Commission (ITC)’s InnoHK, by Smart Traffic Fund PSRI/76/2311/PR, by RGC General Research Fund Project 14204021. Hongsheng Li is a PI of CPII under the InnoHK. This work was also partially supported by NSF of China (No. 61932003).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y. et al. (2025). BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation Using RGB Frames and Events. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15125. Springer, Cham. https://doi.org/10.1007/978-3-031-72855-6_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-72855-6_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72854-9
Online ISBN: 978-3-031-72855-6
eBook Packages: Computer ScienceComputer Science (R0)