Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

CMAF: Cross-Modal Augmentation via Fusion for Underwater Acoustic Image Recognition

Published: 11 January 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Underwater image recognition is crucial for underwater detection applications. Fish classification has been one of the emerging research areas in recent years. Existing image classification models usually classify data collected from terrestrial environments. However, existing image classification models trained with terrestrial data are unsuitable for underwater images, as identifying underwater data is challenging due to their incomplete and noisy features. To address this, we propose a cross-modal augmentation via fusion (CMAF) framework for acoustic-based fish image classification. Our approach involves separating the process into two branches: visual modality and sonar signal modality, where the latter provides a complementary character feature. We augment the visual modality, design an attention-based fusion module, and adopt a masking-based training strategy with a mask-based focal loss to improve the learning of local features and address the class imbalance problem. Our proposed method outperforms the state-of-the-art methods. Our source code is available at https://github.com/WilkinsYang/CMAF.

    References

    [1]
    D. Akkaynak, T. Treibitz, T. Shlesinger, et al. 2017. What is the space of attenuation coefficients in underwater computer vision?. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). Honolulu, HI, USA, 568–577.
    [2]
    B. Bovcon and M. Kristan. 2022. WaSR-A water segmentation and refinement maritime obstacle detection network. IEEE Transactions on Cybernetics 52, 12 (2022), 12661–12674.
    [3]
    M. Buda, A. Maki, and M. A. Mazurowski. 2018. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106 (2018), 249–259.
    [4]
    P. Cai, S. Wang, Y. Sun, et al. 2020. Probabilistic end-to-end vehicle navigation in complex dynamic environments with multimodal sensor fusion. IEEE Robotics and Automation Letters 5, 3 (2020), 4218–4224.
    [5]
    N. V. Chawla, K. W. Bowyer, L. O. Hall, et al. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321–357.
    [6]
    C. Chiamanusorn and K. Sinapiromsaran. 2017. Extreme anomalous oversampling technique for class imbalance. In Proceedings of ACM International Conference on Information Technology (ICIT’17). Bhubaneshwar, India, 341–345.
    [7]
    A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations (ICLR’21). Vienna, Austria.
    [8]
    B. D. Dushaw, P. F. Worcester, B. D. Cornuelle, et al. 1993. On equations for the speed of sound in seawater. The Journal of the Acoustical Society of America 93, 1 (1993), 255–275.
    [9]
    J. Frouzova, J. Kubecka, H. Balk, et al. 2005. Target strength of some european fish species and its dependence on fish body parameters. Fisheries Research 75, 1 (2005), 86–96.
    [10]
    J. Goodman, S. Sarkani, and T. Mazzuchi. 2022. Distance-based probabilistic data augmentation for synthetic minority oversampling. ACM/IMS Trans. Data Sci. 2, 4 (may2022).
    [11]
    M. H. Guo, Z. Ni. Liu, T. J. Mu, et al. 2022. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 1–13.
    [12]
    W. Guo, J. Wang, and S. Wang. 2019. Deep multimodal representation learning: A survey. IEEE Access 7 (2019), 63373–63394.
    [13]
    J. Ha and J. S. Lee. 2016. A new under-sampling method using genetic algorithm for imbalanced data classification. In Proceedings of ACM International Conference on Ubiquitous Information Management and Communication (IMCOM ’16). Danang, Viet Nam.
    [14]
    Z. Han, F. Yang, J. Huang, et al. 2022. Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). New Orleans, LA, USA, 20675–20685.
    [15]
    K. J. Hartman and B. W. Nagy. 2005. A target strength and length relationship for striped bass and white perch. Transactions of the American Fisheries Society 134, 2 (2005), 375–380.
    [16]
    E. L. Hazen and J. K. Horne. 2003. A method for evaluating the effects of biological factors on fish target strength. ICES Journal of Marine Science 60, 3 (2003), 555–562.
    [17]
    X. Hu, Y. Ji, and G. A. Kumie. 2022. Multi-level multi-modal feature fusion for action recognition in videos. In Proceedings of ACM International Workshop on Human-Centric Multimedia Analysis (HCMA ’22). Lisboa, Portugal, 25–33.
    [18]
    C. Huang, Y. Li, C. C. Loy, et al. 2016. Learning deep representation for imbalanced classification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). Las Vegas, NV, USA, 5375–5384.
    [19]
    M. Jian, X. Liu, H. Luo, et al. 2021. Underwater image processing and analysis: A review. Signal Processing: Image Communication 91 (2021), 116088.
    [20]
    S. Jiang, C. Xing, Z. Wan, et al. 2021. Research on multiplicative speckle noise denoising method of side-scan sonar image based on analysis sparse decomposition. In Proceedings of OES China Ocean Acoustics (COA’21). 1016–1020.
    [21]
    Y. Jiang, B. Ku, W. Kim, et al. 2021. Side-scan sonar image synthesis based on generative adversarial network for images in multiple frequencies. IEEE Geoscience and Remote Sensing Letters 18, 9 (2021), 1505–1509.
    [22]
    L. Jin, H. Liang, and C. Yang. 2019. Accurate underwater ATR in forward-looking sonar imagery using deep convolutional neural networks. IEEE Access 7 (2019), 125522–125531.
    [23]
    J. A. Jose, C. S. Kumar, and S. Sureshkumar. 2021. An ensemble of region-based CNN models combined by sum rule for tuna classification. In Proceedings of IEEE International Conference on Communication, Control and Information Sciences (ICCISc’21), Vol. 1. Idukki, India, 1–6.
    [24]
    B. Krawczyk, C. Bellinger, R. Corizzo, et al. 2021. Undersampling with support vectors for multi-class imbalanced data classification. In Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN’21). Shenzhen, China, 1–7.
    [25]
    B. Krawczyk, M. Koziarski, and M. Woźniak. 2020. Radial-based oversampling for multiclass imbalanced data classification. IEEE Transactions on Neural Networks and Learning Systems 31, 8 (2020), 2818–2831.
    [26]
    J. T. Lee, M. Jain, H. Park, et al. 2021. Cross-attentional audio-visual fusion for weakly-supervised action localization. In Proceedings of International Conference on Learning Representations (ICLR’21). Vienna, Austria.
    [27]
    C. Li, C. Guo, W. Ren, et al. 2020. An underwater image enhancement benchmark dataset and beyond. IEEE Transactions on Image Processing 29 (2020), 4376–4389.
    [28]
    J. Lilja, T. J. Marjomäki, R. Riikonen, et al. 2000. Side-aspect target strength of atlantic salmon (salmo salar), brown trout (salmo trutta), whitefish (coregonus lavaretus), and pike (esox lucius). Aquatic Living Resources 13, 5 (2000), 355–360.
    [29]
    T. Y. Lin, P. Goyal, R. Girshick, et al. 2017. Focal loss for dense object detection. In Proceedings of IEEE International Conference on Computer Vision (ICCV’17). Venice, Italy, 2999–3007.
    [30]
    B. Liu, X. Tang, R. Tharmarasa, et al. 2020. Underwater target tracking in uncertain multipath ocean environments. IEEE Trans. Aerospace Electron. Systems 56, 6 (2020), 4899–4915.
    [31]
    C. L. Liu and P.Y. Hsieh. 2020. Model-based synthetic sampling for imbalanced data. IEEE Transactions on Knowledge and Data Engineering 32, 8 (2020), 1543–1556.
    [32]
    H. Liu, Z. Dai, D. So, et al. 2021. Pay attention to MLPs. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS’21), Vol. 34. Curran Associates, Inc., Long Beach, CA, USA, 9204–9215.
    [33]
    R. Liu, X. Fan, M. Zhu, et al. 2020. Real-world underwater enhancement: Challenges, benchmarks, and solutions under natural light. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2020), 4861–4875.
    [34]
    X. Long, C. Gan, G. Melo, et al. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of AAAI Conference on Artificial Intelligence, Vol. 32.
    [35]
    R. H. Love. 1977. Target strength of an individual fish at any aspect. The Journal of the Acoustical Society of America 62, 6 (1977), 1397–1403.
    [36]
    A. Mahmood, M. Bennamoun, S. An, et al. 2016. Coral classification with hybrid feature representations. In Proceedings of IEEE International Conference on Image Processing (ICIP’16). Phoenix, AZ, USA, 519–523.
    [37]
    E. McCann, L. Li, K. Pangle, et al. 2018. An underwater observation dataset for fish classification and fishery assessment. Scientific Data 5, 1 (2018), 1–8.
    [38]
    S. Mittal, S. Srivastava, and J. P. Jayanth. 2022. A survey of deep learning techniques for underwater image classification. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–15.
    [39]
    R. Mohammed, J. Rawashdeh, and M. Abdullah. 2020. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of IEEE International Conference on Information and Communication Systems (ICICS’20). Irbid, Jordan, 243–248.
    [40]
    N. Morozs, W. Gorma, B. T. Henson, et al. 2020. Channel modeling for underwater acoustic network simulation. IEEE Access 8 (2020), 136151–136175.
    [41]
    R. A. Moursund, T. J. Carlson, and R. D. Peters. 2003. A fisheries application of a dual-frequency identification sonar acoustic camera. ICES Journal of Marine Science 60, 3 (012003), 678–683.
    [42]
    F. Nobis, M. Geisslinger, M. Weber, et al. 2019. A deep learning-based radar and camera sensor fusion architecture for object detection. In Proceedings of Sensor Data Fusion: Trends, Solutions, Applications (SDF’19). Bonn, Germany, 1–7.
    [43]
    A. Owens and A. A. Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV’18).
    [44]
    A. Piergiovanni, V. Casser, M. S. Ryoo, et al. 2021. 4D-net for learned multi-modal alignment. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV’21). Montreal, QC, Canada, 15415–15425.
    [45]
    M. H. Popel, K. M. Hasib, S. Ahsan Habib, et al. 2018. A hybrid under-sampling method (HUSBoost) to classify imbalanced data. In Proceedings of IEEE International Conference of Computer and Information Technology (ICCIT’18). Dhaka, Bangladesh, 1–7.
    [46]
    D. Połap, N. Wawrzyniak, and M. Włodarczyk-Sielicka. 2022. Side-scan sonar analysis using ROI analysis and deep neural networks. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–8.
    [47]
    R. G. Praveen, W. C. de Melo, N. Ullah, et al. 2022. A joint cross-attention model for audio-visual fusion in dimensional emotion recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’22). New Orleans, LA, USA, 2486–2495.
    [48]
    R. Rusmadi and R. C. Hasan. 2020. Performance of different classifiers for marine habitat mapping using side scan sonar and object-based image analysis. In Proceedings of IOP Conference Series: Earth and Environmental Science, Vol. 540. IOP Publishing, 012087.
    [49]
    M. Saripuddin, A. Suliman, S. Syarmila Sameon, et al. 2022. Random undersampling on imbalance time series data for anomaly detection. In Proceedings of ACM International Conference on Machine Learning and Machine Intelligence (MLMI’21). Hangzhou, China, 151–156.
    [50]
    C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, et al. 2008. Resampling or reweighting: A comparison of boosting implementations. In Proceedings of IEEE International Conference on Tools with Artificial Intelligence (ICTAI’08), Vol. 1. Dayton, OH, USA, 445–451.
    [51]
    P. Sharma, I. Bisht, and A. Sur. 2023. Wavelength-based attributed deep neural network for underwater image restoration. ACM Trans. Multimedia Comput. Commun. Appl. 19, 1 (2023).
    [52]
    N. Shvetsova, B. Chen, A. Rouditchenko, et al. 2022. Everything at once - multi-modal fusion transformer for video retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). New Orleans, LA, USA, 20020–20029.
    [53]
    K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
    [54]
    M. Stojanovic. 2007. On the relationship between capacity and distance in an underwater acoustic communication channel. ACM SIGMOBILE Mob. Comput. Commun. Rev. 11, 4 (oct2007), 34–43.
    [55]
    K. Terayama, K. Shin, K. Mizuno, et al. 2019. Integration of sonar and optical camera images using deep neural network for fish monitoring. Aquacultural Engineering 86 (2019), 102000.
    [56]
    J. L. Thorp, J. Ainslie, I. Eckstein, et al. 2021. FNet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824 (2021).
    [57]
    W. H. Thorp. 1967. Analytic description of the low-frequency attenuation coefficient. The Journal of the Acoustical Society of America 42, 1 (1967), 270–270.
    [58]
    I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, et al. 2021. MLP-Mixer: An all-MLP architecture for vision. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS’21), Vol. 34. Curran Associates, Inc., 24261–24272.
    [59]
    I. Triguero, M. Galar, S. Vluymans, et al. 2015. Evolutionary undersampling for imbalanced big data classification. In Proceedings of IEEE Congress on Evolutionary Computation (CEC’15). Sendai, Japan, 715–722.
    [60]
    M. K. Tripathi, H. Govil, and P. Diwan. 2019. Comparative evaluation threshold parameters of spectral angle mapper (SAM) for mapping of chhabadiya talc minerals, jahajpur, bhilwara, india using hyperion hyperspectral remote sensing data. In Proceedings of International Conference on Intelligent Communication and Computational Techniques (ICCT’19). Jaipur, India, 70–74.
    [61]
    R. Urick and W. Kuperman. 1989. Ambient Noise in the Sea. Acoustical Society of America.
    [62]
    R. J. Urick. 1983. Principles of Underwater Sound. Peninsula Publishing. 96068876
    [63]
    A. Vaswani, N. Shazeer, N. Parmar, et al. 2017. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS’17), Vol. 30. Curran Associates, Inc., Long Beach, CA, USA.
    [64]
    K. Volkan, S. AKGÜL, and Ö. Z. TANIR. 2023. IsVoNet8: A proposed deep learning model for classification of some fish species. Journal of Agricultural Sciences 29, 1 (2023), 298–307.
    [65]
    C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao. 2022. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022).
    [66]
    W. Wang, D. Tran, and M. Feiszli. 2020. What makes training multi-modal classification networks hard?. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). Seattle, WA, USA, 12692–12702.
    [67]
    Y. Wang, Y. Cao, J. Zhang, et al. 2021. Leveraging deep statistics for underwater image enhancement. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3s, Article 116 (oct2021), 20 pages.
    [68]
    J. Wu, J. Jiang, M. Qi, et al. 2022. An end-to-end heterogeneous restraint network for RGB-d cross-modal person re-identification. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4, Article 109 (mar2022), 22 pages.
    [69]
    J. Wu, X. L. Liu, Q. Lu, et al. 2022. FW-GAN: Underwater image enhancement using generative adversarial network with multi-scale fusion. Signal Processing: Image Communication 109 (2022), 116855.
    [70]
    T. Xie, X. Cheng, X. Wang, et al. 2021. Feature mining: A novel training strategy for convolutional neural network. arXiv preprint arXiv:2107.08421 (2021).
    [71]
    Y. Xie, M. Qiu, H. Zhang, et al. 2022. Gaussian distribution based oversampling for imbalanced data classification. IEEE Transactions on Knowledge and Data Engineering 34, 2 (2022), 667–679.
    [72]
    H. Xu, R. Zeng, Q. Wu, et al. 2020. Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of ACM International Conference on Multimedia (MM ’20). Seattle, WA, USA, 3893–3901.
    [73]
    H. Xu, L. Zhang, M. J. Er, et al. 2021. Underwater sonar image segmentation based on deep learning of receptive field block and search attention mechanism. In Proceedings of International Conference on Intelligent Autonomous Systems (ICoIAS’21). Wuhan, China, 44–48.
    [74]
    Y. Yang, J. Zhang, F. Gao, et al. 2022. DOMFN: A divergence-orientated multi-modal fusion network for resume assessment. In Proceedings of ACM International Conference on Multimedia (MM ’22). Lisboa, Portugal, 1612–1620.
    [75]
    J. Yu, Y. Cheng, R. W. Zhao, et al. 2022. MM-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing. In Proceedings of ACM International Conference on Multimedia (MM ’22). Lisboa, Portugal, 6241–6249.
    [76]
    S. Yu, J. Guo, R. Zhang, et al. 2022. A re-balancing strategy for class-imbalanced classification based on instance difficulty. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 70–79.
    [77]
    Y. Yu, S. Tang, F. Raposo, et al. 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 20 (feb2019), 16 pages.
    [78]
    W. Zhang, F. Qiu, S. Wang, et al. 2022. Transformer-based multimodal information fusion for facial expression analysis. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’22). New Orleans, LA, USA, 2428–2437.
    [79]
    H. Zhao, C. Gan, A. Rouditchenko, et al. 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV’18).
    [80]
    J. Zhou, D. Zhang, and W. Zhang. 2022. Underwater image enhancement method via multi-feature prior fusion. Applied Intelligence (2022), 1–23.
    [81]
    T. Zhou, J. Si, L. Wang, et al. 2022. Automatic detection of underwater small targets using forward-looking sonar images. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–12.

    Cited By

    View all
    • (2024)Pseudo Label Association and Prototype-Based Invariant Learning for Semi-Supervised NIR-VIS Face RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.336453033(1448-1463)Online publication date: 1-Jan-2024
    • (2024)Unsupervised NIR-VIS Face Recognition via Homogeneous-to-Heterogeneous Learning and Residual-Invariant EnhancementIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334617619(2112-2126)Online publication date: 1-Jan-2024

    Index Terms

    1. CMAF: Cross-Modal Augmentation via Fusion for Underwater Acoustic Image Recognition

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 5
      May 2024
      650 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613634
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 January 2024
      Online AM: 06 December 2023
      Accepted: 27 November 2023
      Revised: 17 September 2023
      Received: 01 April 2023
      Published in TOMM Volume 20, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Neural networks
      2. sonar image
      3. multi-modal fusion
      4. class imbalance

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)313
      • Downloads (Last 6 weeks)40
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Pseudo Label Association and Prototype-Based Invariant Learning for Semi-Supervised NIR-VIS Face RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.336453033(1448-1463)Online publication date: 1-Jan-2024
      • (2024)Unsupervised NIR-VIS Face Recognition via Homogeneous-to-Heterogeneous Learning and Residual-Invariant EnhancementIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334617619(2112-2126)Online publication date: 1-Jan-2024

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media