Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Improving visual question answering for bridge inspection by pre‐training with external data of image–text pairs

Published: 26 January 2024 Publication History

Abstract

This paper explores the application of visual question answering (VQA) in bridge inspection using recent advancements in multimodal artificial intelligence (AI) systems. VQA involves an AI model providing natural language answers to questions about the content of an input image. However, applying VQA to bridge inspection poses challenges due to the high cost of creating training data that requires expert knowledge. To address this, we propose leveraging existing bridge inspection reports, which already include image–text pairs, as external knowledge to enhance VQA performance. Our approach involves training the model on a large collection of image–text pairs, followed by fine‐tuning it on a limited amount of training data specifically designed for the VQA task. The results demonstrate a significant improvement in VQA accuracy using this approach. These findings highlight the potential of AI models for VQA as valuable tools for assessing the condition of bridges.

References

[1]
Adeli, H. (2001). Neural networks in civil engineering: 1989–2000. Computer‐Aided Civil and Infrastructure Engineering, 16(2), 126–142.
[2]
Amezquita‐Sanchez, J., & Adeli, H. (2015). Feature extraction and classification techniques for health monitoring of structures. Scientia Iranica, 22(6), 1931–1940.
[3]
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom‐up and topdown attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT (pp. 6077–6086).
[4]
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile (pp. 2425–2433).
[5]
Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O. K., Aggarwal, K., & Wei, F. (2022). VLMo: Unified vision‐language pre‐training with mixture‐of‐modalityexperts. Advances in Neural Information Processing Systems, 35, 32897–32912.
[6]
Chen, X., Ye, Y., Zhang, X., & Yu, C. (2020). Bridge damage detection and recognition based on deep learning. Journal of Physics: Conference Series, 1626, 012151.
[7]
Chen, Y.‐C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., & Liu, J. (2020). Uniter: Universal image‐text representation learning. Computer Vision–ECCV 2020: 16th European Conference. Glasgow, UK (pp. 104–120).
[8]
Chun, P.‐J., Izumi, S., & Yamane, T. (2021). Automatic detection method of cracks from concrete surface imagery using two‐step light gradient boosting machine. Computer‐Aided Civil and Infrastructure Engineering, 36(1), 61–72.
[9]
Chun, P.‐J., Yamane, T., & Maemura, Y. (2022). A deep learning‐based image captioning method to automatically generate comprehensive explanations of bridge damage. Computer‐Aided Civil and Infrastructure Engineering, 37(11), 1387–1401.
[10]
Devlin, J., Chang, M.‐W., Lee, K., & Toutanova, K. (2018). Bert: Pre‐training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[11]
Dong, X., Lian, J., Wang, H., Yu, T., & Zhao, Y. (2018). Structural vibration monitoring and operational modal analysis of offshore wind turbine structure. Ocean Engineering, 150, 280–297.
[12]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
[13]
Eslami, S., de Melo, G., & Meinel, C. (2021). Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906.
[14]
Goyal, Y., Khot, T., Summers‐Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA matter: Elevating the role of image understanding in visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI (pp. 6904–6913).
[15]
Gupta, D., Suman, S., & Ekbal, A. (2021). Hierarchical deep multi‐modal network for medical visual question answering. Expert Systems with Applications, 164, 113993.
[16]
Hsu, T. Y., Giles, C. L., & Huang, T. H. (2021). SciCap: Generating Captions for Scientific Figures. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 3258–3264).
[17]
Jiang, H., Misra, I., Rohrbach, M., Learned‐Miller, E., & Chen, X. (2020). In defense of grid features for visual question answering. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA (pp. 10267–10276).
[18]
Kafle, K., & Kanan, C. (2017). An analysis of visual question answering algorithms. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy (pp. 1965–1973).
[19]
Kazemi, V., & Elqursh, A. (2017). Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv preprint arXiv:1704.03162.
[20]
Khare, Y., Bagal, V., Mathew, M., Devi, A., Priyakumar, U. D., & Jawahar, C. (2021). MMBERT: Multimodal BERT pretraining for improved medical VQA. 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France (pp. 1033–1036).
[21]
Kim, W., Son, B., & Kim, I. (2021). Vilt: Vision‐and‐language transformer without convolution or region supervision. In International Conference on Machine Learning (pp. 5583–5594). PMLR.
[22]
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping language‐image pre‐training for unified vision‐language understanding and generation. In International Conference on Machine Learning, Baltimore, MD (pp. 12888–12900).
[23]
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34, 9694–9705.
[24]
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.‐J., & Chang, K.‐W. (2019). Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
[25]
Li, T., & Harris, D. (2019). Automated construction of bridge condition inventory using natural language processing and historical inspection reports. In Nondestructive characterization and monitoring of advanced materials, aerospace, civil infrastructure, and transportation XIII (Vol., 10971, pp. 206–213). SPIE.
[26]
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., & Gao, J. (2020). Oscar: Object‐semantics aligned pre‐training for vision‐language tasks. Computer vision–ECCV 2020: 16th European Conference, Glasgow, UK (pp. 121–137).
[27]
Liu, K., & El‐Gohary, N. (2017). Ontology‐based semi‐supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in construction, 81, 313–327.
[28]
Lobry, S., Demir, B., & Tuia, D. (2021). RSVQA meets bigearthnet: A new, large‐scale, visual question answering dataset for remote sensing. 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium (pp. 1218–1221).
[29]
Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). Vilbert: Pretraining task‐agnostic visiolinguistic representations for vision‐and‐language tasks. Advances in neural information processing systems, 32.
[30]
Maeda, H., Kashiyama, T., Sekimoto, Y., Seto, T., & Omata, H. (2021). Generative adversarial network for road damage detection. Computer‐Aided Civil and Infrastructure Engineering, 36(1), 47–60.
[31]
Mishra, A., Anand, A., & Guha, P. (2020). CQ‐VQA: Visual question answering on categorized questions. 2020 International Joint Conference on neural Networks (IJCNN), Glasgow, UK (pp. 1–8).
[32]
Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734.
[33]
Narazaki, Y., Hoskere, V., Hoang, T. A., & Spencer, B. F. Jr.(2018). Automated vision‐based bridge component extraction using multiscale convolutional neural networks. arXiv preprint arXiv:1805.06042.
[34]
Nguyen, D.‐K., & Okatani, T. (2018). Improved fusion of visual and language representations by dense symmetric co‐attention for visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT (pp. 6087–6096).
[35]
Nguyen, V.‐Q., Suganuma, M., & Okatani, T. (2022). GRIT: Faster and better image captioning transformer using dual visual features. Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel (pp. 167–184).
[36]
Oh, B. K., Kim, K. J., Kim, Y., Park, H. S., & Adeli, H. (2017). Evolutionary learning based sustainable strain sensing model for structural health monitoring of high‐rise buildings. Applied Soft Computing, 58, 576–585.
[37]
Okazaki, Y., Okazaki, S., Asamoto, S., & Chun, P.‐J. (2020). Applicability of machine learning to a crack model in concrete bridges. Computer‐Aided Civil and Infrastructure Engineering, 35(8), 775–792.
[38]
Perez‐Ramirez, C. A., Amezquita‐Sanchez, J. P., ValtierraRodriguez, M., Adeli, H., Dominguez‐Gonzalez, A., & Romero‐Troncoso, R. J. (2019). Recurrent neural network model with Bayesian training and mutual information for response prediction of large buildings. Engineering Structures, 178, 603–615.
[39]
Pezeshki, H., Adeli, H., Pavlou, D., & Siriwardane, S. C. (2023). State of the art in structural health monitoring of offshore and marine structures. In Proceedings of the institution of civil engineers—Maritime engineering (Vol. 176, pp. 89–108). Thomas Telford Ltd.
[40]
Qiao, W., Ma, B., Liu, Q., Wu, X., & Li, G. (2021). Computer vision‐based bridge damage detection using deep convolutional networks with expectation maximum attention module. Sensors, 21(3), 824.
[41]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning, (pp. 8748–8763). PMLR.
[42]
Rafiei, M. H., & Adeli, H. (2017). A novel machine learning based algorithm to detect damage in high‐rise building structures. The Structural Design of Tall and Special Buildings, 26(18), e1400.
[43]
Rafiei, M. H., & Adeli, H. (2018). A novel unsupervised deep learning model for global and local health condition assessment of structures. Engineering Structures, 156, 598–607.
[44]
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real‐time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV (pp. 779–788).
[45]
Ren, F., & Zhou, Y. (2020). CGMVQA: A new classification and generative model for medical visual question answering. IEEE Access, 8, 50626–50636.
[46]
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad‐CAM: Visual explanations from deep networks via gradient‐based localization. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy (pp. 618–626).
[47]
Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.‐W., Yao, Z., & Keutzer, K. (2021). How much can CLIP benefit Vision‐and‐Language tasks? In International Conference on Learning Representations.
[48]
Singh, A., Hu, R., Goswami, V., Couairon, G., Galuba, W., Rohrbach, M., & Kiela, D. (2022). FLAVA: A foundational language and vision alignment model. Proceedings of the IEEE/CVF Conference on Computer Vision and pattern Recognition, New Orleans, LA (pp. 15638–15650).
[49]
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., & Dai, J. (2019). VL‐BERT: Pre‐training of generic visual‐linguistic representations. In International Conference on Learning Representations.
[50]
Tan, H., & Bansal, M. (2019). LXMERT: Learning crossmodality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP‐IJCNLP) (pp. 5100–5111).
[51]
Teney, D., Anderson, P., He, X., & Van Den Hengel, A. (2018). Tips and tricks for visual question answering: Learnings from the 2017 challenge. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT (pp. 4223–4232).
[52]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA.
[53]
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA (pp. 3156–3164).
[54]
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J., Zhou, C., Zhou, J., & Yang, H. (2022). Unifying architectures, tasks, and modalities through a simple sequence‐to‐sequence learning framework. In International Conference on Machine Learning (pp. 23318–23340). PMLR.
[55]
Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., & Cao, Y. (2021). SimVLM: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations.
[56]
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. International Conference on Machine Learning, Lille, France (pp. 2048–2057).
[57]
Yamane, T., & Chun, P.‐J. (2020). Crack detection from a concrete surface image based on semantic segmentation using deep learning. Journal of Advanced Concrete Technology, 18(9), 493–504.
[58]
Yamane, T., Chun, P.‐J., Dang, J., & Okatani, T. (2023). Bridge damage cause estimation using multiple images based on visual question answering. arXiv preprint arXiv:2302.09208.
[59]
Yang, J., Duan, J., Tran, S., Xu, Y., Chanda, S., Chen, L., Zeng, B., Chilimbi, T., & Huang, J. (2022). Vision‐language pre‐training with triple contrastive learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA (pp. 15671–15680).
[60]
Yong, G., Jeon, K., Gil, D., & Lee, G. (2022). Prompt engineering for zero‐shot and few‐shot defect detection and classification using a visual‐language pretrained model. Computer‐Aided Civil and Infrastructure Engineering, 38(11), 1536–1554.
[61]
Yu, Z., Yu, J., Cui, Y., Tao, D., & Tian, Q. (2019). Deep modular co‐attention networks for visual question answering. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA (pp. 6281–6290).
[62]
Yuan, L., Chen, D., Chen, Y.‐L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M., Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B., Xiao, Z., … Zhang, P. (2021). Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
[63]
Zhang, J., Qian, S., & Tan, C. (2022). Automated bridge surface crack detection and segmentation using computer vision‐based deep learning model. Engineering Applications of Artificial Intelligence, 115, 105225.
[64]
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., & Gao, J. (2021). VinVL: Revisiting visual representations in vision‐language models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN (pp. 5579–5588).
[65]
Zhang, Y., Miyamori, Y., Mikami, S., & Saito, T. (2019). Vibration‐based structural state identification by a 1‐dimensional convolutional neural network. Computer Aided Civil and Infrastructure Engineering, 34(9), 822–839.
[66]
Zhang, Y., & Yuen, K.‐V. (2021). Crack detection using fusion features‐based broad learning system and image processing. Computer‐Aided Civil and Infrastructure Engineering, 36(12), 1568–1584.
[67]
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., & Gao, J. (2020). Unified vision‐language pre‐training for image captioning and VQA. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY (pp. 13041–13049).

Cited By

View all
  • (2025)A structure‐oriented loss function for automated semantic segmentation of bridge point cloudsComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1342240:6(801-816)Online publication date: 13-Feb-2025
  • (2025)Earthquake damage detection and level classification method for wooden houses based on convolutional neural networks and onsite photosComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1322440:5(674-694)Online publication date: 4-Feb-2025
  • (2024)Self‐training with Bayesian neural networks and spatial priors for unsupervised domain adaptation in crack segmentationComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1331539:17(2642-2661)Online publication date: 29-Jul-2024
  • Show More Cited By

Index Terms

  1. Improving visual question answering for bridge inspection by pre‐training with external data of image–text pairs
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image Computer-Aided Civil and Infrastructure Engineering
            Computer-Aided Civil and Infrastructure Engineering  Volume 39, Issue 3
            1 February 2024
            157 pages
            EISSN:1467-8667
            DOI:10.1111/mice.v39.3
            Issue’s Table of Contents
            This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.

            Publisher

            John Wiley & Sons, Inc.

            United States

            Publication History

            Published: 26 January 2024

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0
            Reflects downloads up to 08 Mar 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2025)A structure‐oriented loss function for automated semantic segmentation of bridge point cloudsComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1342240:6(801-816)Online publication date: 13-Feb-2025
            • (2025)Earthquake damage detection and level classification method for wooden houses based on convolutional neural networks and onsite photosComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1322440:5(674-694)Online publication date: 4-Feb-2025
            • (2024)Self‐training with Bayesian neural networks and spatial priors for unsupervised domain adaptation in crack segmentationComputer-Aided Civil and Infrastructure Engineering10.1111/mice.1331539:17(2642-2661)Online publication date: 29-Jul-2024
            • (2024)BridgeCLIP: Automatic Bridge Inspection by Utilizing Vision-Language ModelPattern Recognition10.1007/978-3-031-78447-7_5(61-76)Online publication date: 1-Dec-2024

            View Options

            View options

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media