research-article

Improving visual question answering for bridge inspection by pre‐training with external data of image–text pairs

Authors:

Thannarot Kunlamai,

Tatsuro Yamane,

Masanori Suganuma,

Pang‐Jo Chun,

Takayaki OkataniAuthors Info & Claims

Computer‐Aided Civil and Infrastructure Engineering, Volume 39, Issue 3

Pages 345 - 361

https://doi.org/10.1111/mice.13086

Published: 26 January 2024 Publication History

Abstract

This paper explores the application of visual question answering (VQA) in bridge inspection using recent advancements in multimodal artificial intelligence (AI) systems. VQA involves an AI model providing natural language answers to questions about the content of an input image. However, applying VQA to bridge inspection poses challenges due to the high cost of creating training data that requires expert knowledge. To address this, we propose leveraging existing bridge inspection reports, which already include image–text pairs, as external knowledge to enhance VQA performance. Our approach involves training the model on a large collection of image–text pairs, followed by fine‐tuning it on a limited amount of training data specifically designed for the VQA task. The results demonstrate a significant improvement in VQA accuracy using this approach. These findings highlight the potential of AI models for VQA as valuable tools for assessing the condition of bridges.

References

[1]

Adeli, H. (2001). Neural networks in civil engineering: 1989–2000. Computer‐Aided Civil and Infrastructure Engineering, 16(2), 126–142.

[2]

Amezquita‐Sanchez, J., & Adeli, H. (2015). Feature extraction and classification techniques for health monitoring of structures. Scientia Iranica, 22(6), 1931–1940.

[3]

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom‐up and topdown attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT (pp. 6077–6086).

[4]

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile (pp. 2425–2433).

[5]

Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Aggarwal, K., & Wei, F. (2022). VLMo: Unified vision‐language pre‐training with mixture‐of‐modalityexperts. Advances in Neural Information Processing Systems, 35, 32897–32912.

[6]

Chen, X., Ye, Y., Zhang, X., & Yu, C. (2020). Bridge damage detection and recognition based on deep learning. Journal of Physics: Conference Series, 1626, 012151.

[7]

Chen, Y.‐C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., & Liu, J. (2020). Uniter: Universal image‐text representation learning. Computer Vision–ECCV 2020: 16th European Conference. Glasgow, UK (pp. 104–120).

[8]

Chun, P.‐J., Izumi, S., & Yamane, T. (2021). Automatic detection method of cracks from concrete surface imagery using two‐step light gradient boosting machine. Computer‐Aided Civil and Infrastructure Engineering, 36(1), 61–72.

Digital Library

[9]

Chun, P.‐J., Yamane, T., & Maemura, Y. (2022). A deep learning‐based image captioning method to automatically generate comprehensive explanations of bridge damage. Computer‐Aided Civil and Infrastructure Engineering, 37(11), 1387–1401.

[10]

Devlin, J., Chang, M.‐W., Lee, K., & Toutanova, K. (2018). Bert: Pre‐training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[11]

Dong, X., Lian, J., Wang, H., Yu, T., & Zhao, Y. (2018). Structural vibration monitoring and operational modal analysis of offshore wind turbine structure. Ocean Engineering, 150, 280–297.

[12]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.

[13]

Eslami, S., de Melo, G., & Meinel, C. (2021). Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906.

[14]

Goyal, Y., Khot, T., Summers‐Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA matter: Elevating the role of image understanding in visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI (pp. 6904–6913).

[15]

Gupta, D., Suman, S., & Ekbal, A. (2021). Hierarchical deep multi‐modal network for medical visual question answering. Expert Systems with Applications, 164, 113993.

[16]

Hsu, T. Y., Giles, C. L., & Huang, T. H. (2021). SciCap: Generating Captions for Scientific Figures. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 3258–3264).

[17]

Jiang, H., Misra, I., Rohrbach, M., Learned‐Miller, E., & Chen, X. (2020). In defense of grid features for visual question answering. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA (pp. 10267–10276).

[18]

Kafle, K., & Kanan, C. (2017). An analysis of visual question answering algorithms. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy (pp. 1965–1973).

[19]

Kazemi, V., & Elqursh, A. (2017). Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv preprint arXiv:1704.03162.

[20]

Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U. D., & Jawahar, C. (2021). MMBERT: Multimodal BERT pretraining for improved medical VQA. 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France (pp. 1033–1036).

[21]

Kim, W., Son, B., & Kim, I. (2021). Vilt: Vision‐and‐language transformer without convolution or region supervision. In International Conference on Machine Learning (pp. 5583–5594). PMLR.

[22]

Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping language‐image pre‐training for unified vision‐language understanding and generation. In International Conference on Machine Learning, Baltimore, MD (pp. 12888–12900).

[23]

Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694–9705.

[24]

Li, L. H., Yatskar, M., Yin, D., Hsieh, C.‐J., & Chang, K.‐W. (2019). Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.

[25]

Li, T., & Harris, D. (2019). Automated construction of bridge condition inventory using natural language processing and historical inspection reports. In Nondestructive characterization and monitoring of advanced materials, aerospace, civil infrastructure, and transportation XIII (Vol., 10971, pp. 206–213). SPIE.

[26]

Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., & Gao, J. (2020). Oscar: Object‐semantics aligned pre‐training for vision‐language tasks. Computer vision–ECCV 2020: 16th European Conference, Glasgow, UK (pp. 121–137).

[27]

Liu, K., & El‐Gohary, N. (2017). Ontology‐based semi‐supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in construction, 81, 313–327.

[28]

Lobry, S., Demir, B., & Tuia, D. (2021). RSVQA meets bigearthnet: A new, large‐scale, visual question answering dataset for remote sensing. 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium (pp. 1218–1221).

[29]

Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task‐agnostic visiolinguistic representations for vision‐and‐language tasks. Advances in neural information processing systems, 32.

[30]

Maeda, H., Kashiyama, T., Sekimoto, Y., Seto, T., & Omata, H. (2021). Generative adversarial network for road damage detection. Computer‐Aided Civil and Infrastructure Engineering, 36(1), 47–60.

Digital Library

[31]

Mishra, A., Anand, A., & Guha, P. (2020). CQ‐VQA: Visual question answering on categorized questions. 2020 International Joint Conference on neural Networks (IJCNN), Glasgow, UK (pp. 1–8).

[32]

Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734.

[33]

Narazaki, Y., Hoskere, V., Hoang, T. A., & Spencer, B. F. Jr.(2018). Automated vision‐based bridge component extraction using multiscale convolutional neural networks. arXiv preprint arXiv:1805.06042.

[34]

Nguyen, D.‐K., & Okatani, T. (2018). Improved fusion of visual and language representations by dense symmetric co‐attention for visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT (pp. 6087–6096).

[35]

Nguyen, V.‐Q., Suganuma, M., & Okatani, T. (2022). GRIT: Faster and better image captioning transformer using dual visual features. Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel (pp. 167–184).

[36]

Oh, B. K., Kim, K. J., Kim, Y., Park, H. S., & Adeli, H. (2017). Evolutionary learning based sustainable strain sensing model for structural health monitoring of high‐rise buildings. Applied Soft Computing, 58, 576–585.

[37]

Okazaki, Y., Okazaki, S., Asamoto, S., & Chun, P.‐J. (2020). Applicability of machine learning to a crack model in concrete bridges. Computer‐Aided Civil and Infrastructure Engineering, 35(8), 775–792.

Digital Library

[38]

Perez‐Ramirez, C. A., Amezquita‐Sanchez, J. P., ValtierraRodriguez, M., Adeli, H., Dominguez‐Gonzalez, A., & Romero‐Troncoso, R. J. (2019). Recurrent neural network model with Bayesian training and mutual information for response prediction of large buildings. Engineering Structures, 178, 603–615.

[39]

Pezeshki, H., Adeli, H., Pavlou, D., & Siriwardane, S. C. (2023). State of the art in structural health monitoring of offshore and marine structures. In Proceedings of the institution of civil engineers—Maritime engineering (Vol. 176, pp. 89–108). Thomas Telford Ltd.

[40]

Qiao, W., Ma, B., Liu, Q., Wu, X., & Li, G. (2021). Computer vision‐based bridge damage detection using deep convolutional networks with expectation maximum attention module. Sensors, 21(3), 824.

[41]

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, (pp. 8748–8763). PMLR.

[42]

Rafiei, M. H., & Adeli, H. (2017). A novel machine learning based algorithm to detect damage in high‐rise building structures. The Structural Design of Tall and Special Buildings, 26(18), e1400.

[43]

Rafiei, M. H., & Adeli, H. (2018). A novel unsupervised deep learning model for global and local health condition assessment of structures. Engineering Structures, 156, 598–607.

[44]

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real‐time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV (pp. 779–788).

[45]

Ren, F., & Zhou, Y. (2020). CGMVQA: A new classification and generative model for medical visual question answering. IEEE Access, 8, 50626–50636.

[46]

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad‐CAM: Visual explanations from deep networks via gradient‐based localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy (pp. 618–626).

[47]

Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.‐W., Yao, Z., & Keutzer, K. (2021). How much can CLIP benefit Vision‐and‐Language tasks? In International Conference on Learning Representations.

[48]

Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., & Kiela, D. (2022). FLAVA: A foundational language and vision alignment model. Proceedings of the IEEE/CVF Conference on Computer Vision and pattern Recognition, New Orleans, LA (pp. 15638–15650).

[49]

Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., & Dai, J. (2019). VL‐BERT: Pre‐training of generic visual‐linguistic representations. In International Conference on Learning Representations.

[50]

Tan, H., & Bansal, M. (2019). LXMERT: Learning crossmodality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP‐IJCNLP) (pp. 5100–5111).

[51]

Teney, D., Anderson, P., He, X., & Van Den Hengel, A. (2018). Tips and tricks for visual question answering: Learnings from the 2017 challenge. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT (pp. 4223–4232).

[52]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA.

[53]

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA (pp. 3156–3164).

[54]

Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., & Yang, H. (2022). Unifying architectures, tasks, and modalities through a simple sequence‐to‐sequence learning framework. In International Conference on Machine Learning (pp. 23318–23340). PMLR.

[55]

Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., & Cao, Y. (2021). SimVLM: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations.

[56]

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning, Lille, France (pp. 2048–2057).

[57]

Yamane, T., & Chun, P.‐J. (2020). Crack detection from a concrete surface image based on semantic segmentation using deep learning. Journal of Advanced Concrete Technology, 18(9), 493–504.

[58]

Yamane, T., Chun, P.‐J., Dang, J., & Okatani, T. (2023). Bridge damage cause estimation using multiple images based on visual question answering. arXiv preprint arXiv:2302.09208.

[59]

Yang, J., Duan, J., Tran, S., Xu, Y., Chanda, S., Chen, L., Zeng, B., Chilimbi, T., & Huang, J. (2022). Vision‐language pre‐training with triple contrastive learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA (pp. 15671–15680).

[60]

Yong, G., Jeon, K., Gil, D., & Lee, G. (2022). Prompt engineering for zero‐shot and few‐shot defect detection and classification using a visual‐language pretrained model. Computer‐Aided Civil and Infrastructure Engineering, 38(11), 1536–1554.

[61]

Yu, Z., Yu, J., Cui, Y., Tao, D., & Tian, Q. (2019). Deep modular co‐attention networks for visual question answering. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA (pp. 6281–6290).

[62]

Yuan, L., Chen, D., Chen, Y.‐L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B., Xiao, Z., … Zhang, P. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.

[63]

Zhang, J., Qian, S., & Tan, C. (2022). Automated bridge surface crack detection and segmentation using computer vision‐based deep learning model. Engineering Applications of Artificial Intelligence, 115, 105225.

[64]

Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., & Gao, J. (2021). VinVL: Revisiting visual representations in vision‐language models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN (pp. 5579–5588).

[65]

Zhang, Y., Miyamori, Y., Mikami, S., & Saito, T. (2019). Vibration‐based structural state identification by a 1‐dimensional convolutional neural network. Computer Aided Civil and Infrastructure Engineering, 34(9), 822–839.

Digital Library

[66]

Zhang, Y., & Yuen, K.‐V. (2021). Crack detection using fusion features‐based broad learning system and image processing. Computer‐Aided Civil and Infrastructure Engineering, 36(12), 1568–1584.

Digital Library

[67]

Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., & Gao, J. (2020). Unified vision‐language pre‐training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY (pp. 13041–13049).

Cited By

Lin CAbe SZheng SLi XChun P(2025)A structure‐oriented loss function for automated semantic segmentation of bridge point cloudsComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1342240:6(801-816)Online publication date: 13-Feb-2025
https://dl.acm.org/doi/10.1111/mice.13422
Wu KMatsuoka MOshio H(2025)Earthquake damage detection and level classification method for wooden houses based on convolutional neural networks and onsite photosComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1322440:5(674-694)Online publication date: 4-Feb-2025
https://dl.acm.org/doi/10.1111/mice.13224
Chun PKikuta T(2024)Self‐training with Bayesian neural networks and spatial priors for unsupervised domain adaptation in crack segmentationComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1331539:17(2642-2661)Online publication date: 29-Jul-2024
https://dl.acm.org/doi/10.1111/mice.13315
Show More Cited By

Index Terms

Improving visual question answering for bridge inspection by pre‐training with external data of image–text pairs

Index terms have been assigned to the content through auto-classification.

Recommendations

Visual Reasoning and Image Understanding: A Question Answering Approach
Image captioning improved visual question answering
Abstract
Both Visual Question Answering (VQA) and image captioning are the problems which involve Computer Vision (CV) and Natural Language Processing (NLP) domains. In general, computer vision models are effectively utilized to represent visual contents. ...
Visual question answering: Which investigated applications?
Highlights
- The paper presents concrete applications of Visual Question Answering
- Domains where VQA has been experimented are presented together with the exploited dataset
- The paper suggests some challenging techniques that can be especially ...
Abstract
Visual Question Answering (VQA) is an extremely stimulating and challenging research area where Computer Vision (CV) and Natural Language Processig (NLP) have recently met. In image captioning and video summarization, the semantic information is ...

Comments

Information & Contributors

Information

Published In

cover image Computer-Aided Civil and Infrastructure Engineering

Computer-Aided Civil and Infrastructure Engineering Volume 39, Issue 3

1 February 2024

157 pages

EISSN:1467-8667

DOI:10.1111/mice.v39.3

Issue’s Table of Contents

© 2023 The Authors. Computer‐Aided Civil and Infrastructure Engineering published by Wiley Periodicals LLC on behalf of Editor.

This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 26 January 2024

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lin CAbe SZheng SLi XChun P(2025)A structure‐oriented loss function for automated semantic segmentation of bridge point cloudsComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1342240:6(801-816)Online publication date: 13-Feb-2025
https://dl.acm.org/doi/10.1111/mice.13422
Wu KMatsuoka MOshio H(2025)Earthquake damage detection and level classification method for wooden houses based on convolutional neural networks and onsite photosComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1322440:5(674-694)Online publication date: 4-Feb-2025
https://dl.acm.org/doi/10.1111/mice.13224
Chun PKikuta T(2024)Self‐training with Bayesian neural networks and spatial priors for unsupervised domain adaptation in crack segmentationComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1331539:17(2642-2661)Online publication date: 29-Jul-2024
https://dl.acm.org/doi/10.1111/mice.13315
Liao PNakano G(2024)BridgeCLIP: Automatic Bridge Inspection by Utilizing Vision-Language ModelPattern Recognition10.1007/978-3-031-78447-7_5(61-76)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1007/978-3-031-78447-7_5

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents