research-article

HDA-Net: Horizontal Deformable Attention Network for Stereo Matching

Authors:

Anlong MingAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 32 - 40

https://doi.org/10.1145/3474085.3475273

Published: 17 October 2021 Publication History

Abstract

Stereo matching is a fundamental and challenging task which has various applications in autonomous driving, dense reconstruction and other depth related tasks. Contextual information with discriminative features is crucial for accurate stereo matching in the ill-posed regions (textureless, occlusion, etc.). In this paper, we propose an efficient horizontal attention module to adaptively capture the global correspondence clues. Compared with the popular non-local attention, our horizontal attention is more effective for stereo matching with better performance and lower consumption of computation and memory. We further introduce a deformable module to refine the contextual information in the disparity discontinuous areas such as the boundary of objects. Learning-based method is adopted to construct the cost volume by concatenating the features of two branches. In order to offer explicit similarity measure to guide learning-based volume for obtaining more reasonable unimodal matching cost distribution we additionally combine the learning-based volume with the improved zero-centered group-wise correlation volume. Finally, we regularize the 4D joint cost volume by a 3D CNN module and generate the final output by disparity regression. The experimental results show that our proposed HDA-Net achieves the state-of-the-art performance on the Scene Flow dataset and obtains competitive performance on the KITTI datasets compared with the relevant networks.

References

[1]

N. Parmar A. Vaswani, N. Shazeer and et al. 2017. Attention Is All You Need. In Proceedings of Conference on Neural Information Processing Systems.

Digital Library

[2]

M. Brown, G. Hua, and S. Winder. 2011. Discriminative Learning of Local Image Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 33, 1 (2011), 43--57. https://doi.org/10.1109/TPAMI.2010.54

Digital Library

[3]

R. Chabra, J. Straub, C. Sweeney, R. Newcombe, and H. Fuchs. 2019. StereoDRNet: Dilated Residual StereoNet. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 11778--11787. https://doi.org/10.1109/CVPR.2019.01206

[4]

J. Chang and Y. Chen. 2018. Pyramid Stereo Matching Network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5410--5418. https://doi.org/10.1109/CVPR.2018.00567

[5]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan Yuille. 2016. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PP (06 2016). https://doi.org/10.1109/TPAMI.2017.2699184

[6]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, Vol. 40, 4 (2017), 834--848.

[7]

Zhuoyuan Chen, Xun Sun, Liang Wang, Yinan Yu, and Chang Huang. 2015. A Deep Visual Correspondence Embedding Model for Stereo Matching Costs. In Proceedings of IEEE International Conference on Computer Vision.

Digital Library

[8]

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision. 2758--2766.

Digital Library

[9]

J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu. 2019. Dual Attention Network for Scene Segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3141--3149. https://doi.org/10.1109/CVPR.2019.00326

[10]

Andreas Geiger, Martin Roser, and Raquel Urtasun. 2010. Efficient Large-Scale Stereo Matching. In Proceedings of Asian Conference on Computer Vision.

Digital Library

[11]

X. Guo, K. Yang, W. Yang, X. Wang, and H. Li. 2019. Group-Wise Correlation Stereo Network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 3268--3277. https://doi.org/10.1109/CVPR.2019.00339

[12]

Z. Guo, L. Zhang, and D. Zhang. 2010. A Completed Modeling of Local Binary Pattern Operator for Texture Classification. IEEE Transactions on Image Processing, Vol. 19, 6 (2010), 1657--1663. https://doi.org/10.1109/TIP.2010.2044957

Digital Library

[13]

Rostam Affendi Hamzah, Rosman Abd Rahim, and Zarina Mohd Noh. 2010. Sum of absolute differences algorithm in stereo correspondence problem for stereo matching in computer vision application. In 2010 3rd International Conference on Computer Science and Information Technology, Vol. 1. IEEE, 652--657.

[14]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 770--778. https://doi.org/10.1109/CVPR.2016.90

[15]

Z. Huang, X. Wang, Y. Wei, L. Huang, H. Shi, W. Liu, and T. S. Huang. 2020. CCNet: Criss-Cross Attention for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 1--1. https://doi.org/10.1109/TPAMI.2020.3007032

[16]

Adrian Johnston and Gustavo Carneiro. 2020. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4756--4765.

[17]

A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. 2017. End-to-End Learning of Geometry and Context for Deep Stereo Regression. In Proceedings of IEEE International Conference on Computer Vision. 66--75. https://doi.org/10.1109/ICCV.2017.17

[18]

Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. Proceedings of International Conference on Learning Representations (12 2014).

[19]

Marvin Klingner, Jan-Aike Termöhlen, Jonas Mikolajczyk, and Tim Fingscheidt. 2020. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision. Springer, 582--600.

[20]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431--3440.

[21]

Wenjie Luo, Alexander G. Schwing, and Raquel Urtasun. 2016. Efficient Deep Learning for Stereo Matching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5695--5703.

[22]

Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. 2016. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4040--4048. https://doi.org/10.1109/CVPR.2016.438

[23]

Moritz Menze and Andreas Geiger. 2015. Object scene flow for autonomous vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3061--3070.

[24]

Zhibo Rao, Mingyi He, Yuchao Dai, Zhidong Zhu, and Renjie He. 2020. NLCA-Net: a non-local context attention network for stereo matching. APSIPA Transactions on Signal and Information Processing, Vol. 9 (2020).

[25]

H. Sang, Q. Wang, and Y. Zhao. 2019. Multi-Scale Context Attention Network for Stereo Matching. IEEE Access, Vol. 7 (2019), 15152--15161. https://doi.org/10.1109/ACCESS.2019.2895271

[26]

D. Scharstein, R. Szeliski, and R. Zabih. 2001. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. In Proceedings of IEEE Workshop on Stereo and Multi-Baseline Vision. 131--140. https://doi.org/10.1109/SMBV.2001.988771

Digital Library

[27]

Amit Shaked and Lior Wolf. 2017. Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning. In Proceedings of IEEE Computer Vision and Pattern Recognition. 6901--6910.

[28]

Zhenyao Wu, Xinyi Wu, Xiaoping Zhang, Song Wang, and Lili Ju. 2019. Semantic Stereo Matching With Pyramid Cost Volumes. In Proceedings of IEEE International Conference on Computer Vision.

[29]

Chen-Wei Xie, Hong-Yu Zhou, and Jianxin Wu. 2018. Vortex Pooling: Improving Context Representation in Semantic Segmentation. ArXiv (04 2018).

[30]

H. Xu and J. Zhang. 2020. AANet: Adaptive Aggregation Network for Efficient Stereo Matching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 1956--1965. https://doi.org/10.1109/CVPR42600.2020.00203

[31]

Guorun Yang, Hengshuang Zhao, Jianping Shi, Zhidong Deng, and Jiaya Jia. 2018. SegStereo: Exploiting Semantic Information for Disparity Estimation. In Proceedings of European Conference on Computer Vision.

[32]

Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018a. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV). 325--341.

Digital Library

[33]

Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018b. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1857--1866.

[34]

Y. Yuan and Jingdong Wang. 2018. OCNet: Object Context Network for Scene Parsing. ArXiv, Vol. abs/1809.00916 (2018).

[35]

Jure Zbontar and Yann LeCun. 2015. Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1592--1599.

[36]

Jure vZ bontar and Yann LeCun. 2016. Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches. Journal of Machine Learning Research, Vol. 17, 65 (2016), 1--32. http://jmlr.org/papers/v17/15--535.html

Digital Library

[37]

Ke Zhang, Jiangbo Lu, Gauthier Lafruit, Rudy Lauwereins, and Luc Van Gool. 2009. Robust stereo matching with fast normalized cross-correlation over shape-adaptive regions. In 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE, 2357--2360.

Digital Library

[38]

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2881--2890.

[39]

X. Zhu, H. Hu, S. Lin, and J. Dai. 2019 a. Deformable ConvNets V2: More Deformable, Better Results. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 9300--9308. https://doi.org/10.1109/CVPR.2019.00953

[40]

Zhidong Zhu, Mingyi He, Yuchao Dai, Zhibo Rao, and Bo Li. 2019. Multi-scale cross-form pyramid network for stereo matching. In 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA). IEEE, 1789--1794.

[41]

Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai. 2019 b. Asymmetric Non-Local Neural Networks for Semantic Segmentation. In Proceedings of IEEE International Conference on Computer Vision (ICCV). 593--602. https://doi.org/10.1109/ICCV.2019.00068

Cited By

zhu YPei SLiu BGao J(2025)AP-Net: Attention-fused volume and progressive aggregation for accurate stereo matchingNeurocomputing10.1016/j.neucom.2024.128685612(128685)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128685
Wang SSeo JJeon HLim SPark SLim Y(2023)Horizontal Attention Based Generation Module for Unsupervised Domain Adaptive Stereo MatchingIEEE Robotics and Automation Letters10.1109/LRA.2023.33130098:10(6779-6786)Online publication date: Oct-2023
https://doi.org/10.1109/LRA.2023.3313009

Index Terms

HDA-Net: Horizontal Deformable Attention Network for Stereo Matching
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Occlusion-Aware Stereo Matching

Stereo vision systems with additional flash/no-flash cues have been demonstrated to be robust to depth discontinuities. The ratio of a flash and no-flash image pair naturally provides additional scene depth information and thus can serve as a strong cue ...
Area-based correlation and non-local attention network for stereo matching
Abstract
Stereo matching plays an essential role in various computer vision applications. Cost volume is the crucial part in disparity estimation for measuring the similarity between the left-right feature locations. However, most previous cost volume ...
Dense Stereo Matching over the Panum Band

Stereo matching algorithms conventionally match over a range of disparities sufficient to encompass all visible 3D scene points. Human vision, however, works over a narrow band of disparities—Panum's fusional band—whose typical range may be as little as ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
310
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)2

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

zhu YPei SLiu BGao J(2025)AP-Net: Attention-fused volume and progressive aggregation for accurate stereo matchingNeurocomputing10.1016/j.neucom.2024.128685612(128685)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128685
Wang SSeo JJeon HLim SPark SLim Y(2023)Horizontal Attention Based Generation Module for Unsupervised Domain Adaptive Stereo MatchingIEEE Robotics and Automation Letters10.1109/LRA.2023.33130098:10(6779-6786)Online publication date: Oct-2023
https://doi.org/10.1109/LRA.2023.3313009

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents