research-article

CMAF: Cross-Modal Augmentation via Fusion for Underwater Acoustic Image Recognition

Authors:

Li-Hsiang Shen,

Hong-Han Shuai,

Kai-Ten FengAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 5

Article No.: 124, Pages 1 - 25

https://doi.org/10.1145/3636427

Published: 11 January 2024 Publication History

Abstract

Underwater image recognition is crucial for underwater detection applications. Fish classification has been one of the emerging research areas in recent years. Existing image classification models usually classify data collected from terrestrial environments. However, existing image classification models trained with terrestrial data are unsuitable for underwater images, as identifying underwater data is challenging due to their incomplete and noisy features. To address this, we propose a cross-modal augmentation via fusion (CMAF) framework for acoustic-based fish image classification. Our approach involves separating the process into two branches: visual modality and sonar signal modality, where the latter provides a complementary character feature. We augment the visual modality, design an attention-based fusion module, and adopt a masking-based training strategy with a mask-based focal loss to improve the learning of local features and address the class imbalance problem. Our proposed method outperforms the state-of-the-art methods. Our source code is available at https://github.com/WilkinsYang/CMAF.

References

[1]

D. Akkaynak, T. Treibitz, T. Shlesinger, et al. 2017. What is the space of attenuation coefficients in underwater computer vision?. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’17). Honolulu, HI, USA, 568–577.

[2]

B. Bovcon and M. Kristan. 2022. WaSR-A water segmentation and refinement maritime obstacle detection network. IEEE Transactions on Cybernetics 52, 12 (2022), 12661–12674.

[3]

M. Buda, A. Maki, and M. A. Mazurowski. 2018. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106 (2018), 249–259.

Digital Library

[4]

P. Cai, S. Wang, Y. Sun, et al. 2020. Probabilistic end-to-end vehicle navigation in complex dynamic environments with multimodal sensor fusion. IEEE Robotics and Automation Letters 5, 3 (2020), 4218–4224.

[5]

N. V. Chawla, K. W. Bowyer, L. O. Hall, et al. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321–357.

[6]

C. Chiamanusorn and K. Sinapiromsaran. 2017. Extreme anomalous oversampling technique for class imbalance. In Proceedings of ACM International Conference on Information Technology (ICIT’17). Bhubaneshwar, India, 341–345.

Digital Library

[7]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations (ICLR’21). Vienna, Austria.

[8]

B. D. Dushaw, P. F. Worcester, B. D. Cornuelle, et al. 1993. On equations for the speed of sound in seawater. The Journal of the Acoustical Society of America 93, 1 (1993), 255–275.

[9]

J. Frouzova, J. Kubecka, H. Balk, et al. 2005. Target strength of some european fish species and its dependence on fish body parameters. Fisheries Research 75, 1 (2005), 86–96.

[10]

J. Goodman, S. Sarkani, and T. Mazzuchi. 2022. Distance-based probabilistic data augmentation for synthetic minority oversampling. ACM/IMS Trans. Data Sci. 2, 4 (may2022).

Digital Library

[11]

M. H. Guo, Z. Ni. Liu, T. J. Mu, et al. 2022. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022), 1–13.

[12]

W. Guo, J. Wang, and S. Wang. 2019. Deep multimodal representation learning: A survey. IEEE Access 7 (2019), 63373–63394.

[13]

J. Ha and J. S. Lee. 2016. A new under-sampling method using genetic algorithm for imbalanced data classification. In Proceedings of ACM International Conference on Ubiquitous Information Management and Communication (IMCOM ’16). Danang, Viet Nam.

Digital Library

[14]

Z. Han, F. Yang, J. Huang, et al. 2022. Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). New Orleans, LA, USA, 20675–20685.

[15]

K. J. Hartman and B. W. Nagy. 2005. A target strength and length relationship for striped bass and white perch. Transactions of the American Fisheries Society 134, 2 (2005), 375–380.

[16]

E. L. Hazen and J. K. Horne. 2003. A method for evaluating the effects of biological factors on fish target strength. ICES Journal of Marine Science 60, 3 (2003), 555–562.

[17]

X. Hu, Y. Ji, and G. A. Kumie. 2022. Multi-level multi-modal feature fusion for action recognition in videos. In Proceedings of ACM International Workshop on Human-Centric Multimedia Analysis (HCMA ’22). Lisboa, Portugal, 25–33.

Digital Library

[18]

C. Huang, Y. Li, C. C. Loy, et al. 2016. Learning deep representation for imbalanced classification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’16). Las Vegas, NV, USA, 5375–5384.

[19]

M. Jian, X. Liu, H. Luo, et al. 2021. Underwater image processing and analysis: A review. Signal Processing: Image Communication 91 (2021), 116088.

[20]

S. Jiang, C. Xing, Z. Wan, et al. 2021. Research on multiplicative speckle noise denoising method of side-scan sonar image based on analysis sparse decomposition. In Proceedings of OES China Ocean Acoustics (COA’21). 1016–1020.

[21]

Y. Jiang, B. Ku, W. Kim, et al. 2021. Side-scan sonar image synthesis based on generative adversarial network for images in multiple frequencies. IEEE Geoscience and Remote Sensing Letters 18, 9 (2021), 1505–1509.

[22]

L. Jin, H. Liang, and C. Yang. 2019. Accurate underwater ATR in forward-looking sonar imagery using deep convolutional neural networks. IEEE Access 7 (2019), 125522–125531.

[23]

J. A. Jose, C. S. Kumar, and S. Sureshkumar. 2021. An ensemble of region-based CNN models combined by sum rule for tuna classification. In Proceedings of IEEE International Conference on Communication, Control and Information Sciences (ICCISc’21), Vol. 1. Idukki, India, 1–6.

[24]

B. Krawczyk, C. Bellinger, R. Corizzo, et al. 2021. Undersampling with support vectors for multi-class imbalanced data classification. In Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN’21). Shenzhen, China, 1–7.

[25]

B. Krawczyk, M. Koziarski, and M. Woźniak. 2020. Radial-based oversampling for multiclass imbalanced data classification. IEEE Transactions on Neural Networks and Learning Systems 31, 8 (2020), 2818–2831.

[26]

J. T. Lee, M. Jain, H. Park, et al. 2021. Cross-attentional audio-visual fusion for weakly-supervised action localization. In Proceedings of International Conference on Learning Representations (ICLR’21). Vienna, Austria.

[27]

C. Li, C. Guo, W. Ren, et al. 2020. An underwater image enhancement benchmark dataset and beyond. IEEE Transactions on Image Processing 29 (2020), 4376–4389.

[28]

J. Lilja, T. J. Marjomäki, R. Riikonen, et al. 2000. Side-aspect target strength of atlantic salmon (salmo salar), brown trout (salmo trutta), whitefish (coregonus lavaretus), and pike (esox lucius). Aquatic Living Resources 13, 5 (2000), 355–360.

[29]

T. Y. Lin, P. Goyal, R. Girshick, et al. 2017. Focal loss for dense object detection. In Proceedings of IEEE International Conference on Computer Vision (ICCV’17). Venice, Italy, 2999–3007.

[30]

B. Liu, X. Tang, R. Tharmarasa, et al. 2020. Underwater target tracking in uncertain multipath ocean environments. IEEE Trans. Aerospace Electron. Systems 56, 6 (2020), 4899–4915.

[31]

C. L. Liu and P.Y. Hsieh. 2020. Model-based synthetic sampling for imbalanced data. IEEE Transactions on Knowledge and Data Engineering 32, 8 (2020), 1543–1556.

[32]

H. Liu, Z. Dai, D. So, et al. 2021. Pay attention to MLPs. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS’21), Vol. 34. Curran Associates, Inc., Long Beach, CA, USA, 9204–9215.

[33]

R. Liu, X. Fan, M. Zhu, et al. 2020. Real-world underwater enhancement: Challenges, benchmarks, and solutions under natural light. IEEE Transactions on Circuits and Systems for Video Technology 30, 12 (2020), 4861–4875.

Digital Library

[34]

X. Long, C. Gan, G. Melo, et al. 2018. Multimodal keyless attention fusion for video classification. In Proceedings of AAAI Conference on Artificial Intelligence, Vol. 32.

[35]

R. H. Love. 1977. Target strength of an individual fish at any aspect. The Journal of the Acoustical Society of America 62, 6 (1977), 1397–1403.

[36]

A. Mahmood, M. Bennamoun, S. An, et al. 2016. Coral classification with hybrid feature representations. In Proceedings of IEEE International Conference on Image Processing (ICIP’16). Phoenix, AZ, USA, 519–523.

[37]

E. McCann, L. Li, K. Pangle, et al. 2018. An underwater observation dataset for fish classification and fishery assessment. Scientific Data 5, 1 (2018), 1–8.

[38]

S. Mittal, S. Srivastava, and J. P. Jayanth. 2022. A survey of deep learning techniques for underwater image classification. IEEE Transactions on Neural Networks and Learning Systems (2022), 1–15.

[39]

R. Mohammed, J. Rawashdeh, and M. Abdullah. 2020. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of IEEE International Conference on Information and Communication Systems (ICICS’20). Irbid, Jordan, 243–248.

[40]

N. Morozs, W. Gorma, B. T. Henson, et al. 2020. Channel modeling for underwater acoustic network simulation. IEEE Access 8 (2020), 136151–136175.

[41]

R. A. Moursund, T. J. Carlson, and R. D. Peters. 2003. A fisheries application of a dual-frequency identification sonar acoustic camera. ICES Journal of Marine Science 60, 3 (012003), 678–683.

[42]

F. Nobis, M. Geisslinger, M. Weber, et al. 2019. A deep learning-based radar and camera sensor fusion architecture for object detection. In Proceedings of Sensor Data Fusion: Trends, Solutions, Applications (SDF’19). Bonn, Germany, 1–7.

[43]

A. Owens and A. A. Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In Proceedings of the European Conference on Computer Vision (ECCV’18).

Digital Library

[44]

A. Piergiovanni, V. Casser, M. S. Ryoo, et al. 2021. 4D-net for learned multi-modal alignment. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV’21). Montreal, QC, Canada, 15415–15425.

[45]

M. H. Popel, K. M. Hasib, S. Ahsan Habib, et al. 2018. A hybrid under-sampling method (HUSBoost) to classify imbalanced data. In Proceedings of IEEE International Conference of Computer and Information Technology (ICCIT’18). Dhaka, Bangladesh, 1–7.

[46]

D. Połap, N. Wawrzyniak, and M. Włodarczyk-Sielicka. 2022. Side-scan sonar analysis using ROI analysis and deep neural networks. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–8.

[47]

R. G. Praveen, W. C. de Melo, N. Ullah, et al. 2022. A joint cross-attention model for audio-visual fusion in dimensional emotion recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’22). New Orleans, LA, USA, 2486–2495.

[48]

R. Rusmadi and R. C. Hasan. 2020. Performance of different classifiers for marine habitat mapping using side scan sonar and object-based image analysis. In Proceedings of IOP Conference Series: Earth and Environmental Science, Vol. 540. IOP Publishing, 012087.

[49]

M. Saripuddin, A. Suliman, S. Syarmila Sameon, et al. 2022. Random undersampling on imbalance time series data for anomaly detection. In Proceedings of ACM International Conference on Machine Learning and Machine Intelligence (MLMI’21). Hangzhou, China, 151–156.

[50]

C. Seiffert, T. M. Khoshgoftaar, J. V. Hulse, et al. 2008. Resampling or reweighting: A comparison of boosting implementations. In Proceedings of IEEE International Conference on Tools with Artificial Intelligence (ICTAI’08), Vol. 1. Dayton, OH, USA, 445–451.

Digital Library

[51]

P. Sharma, I. Bisht, and A. Sur. 2023. Wavelength-based attributed deep neural network for underwater image restoration. ACM Trans. Multimedia Comput. Commun. Appl. 19, 1 (2023).

Digital Library

[52]

N. Shvetsova, B. Chen, A. Rouditchenko, et al. 2022. Everything at once - multi-modal fusion transformer for video retrieval. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). New Orleans, LA, USA, 20020–20029.

[53]

K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[54]

M. Stojanovic. 2007. On the relationship between capacity and distance in an underwater acoustic communication channel. ACM SIGMOBILE Mob. Comput. Commun. Rev. 11, 4 (oct2007), 34–43.

Digital Library

[55]

K. Terayama, K. Shin, K. Mizuno, et al. 2019. Integration of sonar and optical camera images using deep neural network for fish monitoring. Aquacultural Engineering 86 (2019), 102000.

[56]

J. L. Thorp, J. Ainslie, I. Eckstein, et al. 2021. FNet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824 (2021).

[57]

W. H. Thorp. 1967. Analytic description of the low-frequency attenuation coefficient. The Journal of the Acoustical Society of America 42, 1 (1967), 270–270.

[58]

I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, et al. 2021. MLP-Mixer: An all-MLP architecture for vision. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS’21), Vol. 34. Curran Associates, Inc., 24261–24272.

[59]

I. Triguero, M. Galar, S. Vluymans, et al. 2015. Evolutionary undersampling for imbalanced big data classification. In Proceedings of IEEE Congress on Evolutionary Computation (CEC’15). Sendai, Japan, 715–722.

[60]

M. K. Tripathi, H. Govil, and P. Diwan. 2019. Comparative evaluation threshold parameters of spectral angle mapper (SAM) for mapping of chhabadiya talc minerals, jahajpur, bhilwara, india using hyperion hyperspectral remote sensing data. In Proceedings of International Conference on Intelligent Communication and Computational Techniques (ICCT’19). Jaipur, India, 70–74.

[61]

R. Urick and W. Kuperman. 1989. Ambient Noise in the Sea. Acoustical Society of America.

[62]

R. J. Urick. 1983. Principles of Underwater Sound. Peninsula Publishing. 96068876

[63]

A. Vaswani, N. Shazeer, N. Parmar, et al. 2017. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS’17), Vol. 30. Curran Associates, Inc., Long Beach, CA, USA.

[64]

K. Volkan, S. AKGÜL, and Ö. Z. TANIR. 2023. IsVoNet8: A proposed deep learning model for classification of some fish species. Journal of Agricultural Sciences 29, 1 (2023), 298–307.

[65]

C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao. 2022. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022).

[66]

W. Wang, D. Tran, and M. Feiszli. 2020. What makes training multi-modal classification networks hard?. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). Seattle, WA, USA, 12692–12702.

[67]

Y. Wang, Y. Cao, J. Zhang, et al. 2021. Leveraging deep statistics for underwater image enhancement. ACM Trans. Multimedia Comput. Commun. Appl. 17, 3s, Article 116 (oct2021), 20 pages.

Digital Library

[68]

J. Wu, J. Jiang, M. Qi, et al. 2022. An end-to-end heterogeneous restraint network for RGB-d cross-modal person re-identification. ACM Trans. Multimedia Comput. Commun. Appl. 18, 4, Article 109 (mar2022), 22 pages.

Digital Library

[69]

J. Wu, X. L. Liu, Q. Lu, et al. 2022. FW-GAN: Underwater image enhancement using generative adversarial network with multi-scale fusion. Signal Processing: Image Communication 109 (2022), 116855.

Digital Library

[70]

T. Xie, X. Cheng, X. Wang, et al. 2021. Feature mining: A novel training strategy for convolutional neural network. arXiv preprint arXiv:2107.08421 (2021).

[71]

Y. Xie, M. Qiu, H. Zhang, et al. 2022. Gaussian distribution based oversampling for imbalanced data classification. IEEE Transactions on Knowledge and Data Engineering 34, 2 (2022), 667–679.

Digital Library

[72]

H. Xu, R. Zeng, Q. Wu, et al. 2020. Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of ACM International Conference on Multimedia (MM ’20). Seattle, WA, USA, 3893–3901.

Digital Library

[73]

H. Xu, L. Zhang, M. J. Er, et al. 2021. Underwater sonar image segmentation based on deep learning of receptive field block and search attention mechanism. In Proceedings of International Conference on Intelligent Autonomous Systems (ICoIAS’21). Wuhan, China, 44–48.

[74]

Y. Yang, J. Zhang, F. Gao, et al. 2022. DOMFN: A divergence-orientated multi-modal fusion network for resume assessment. In Proceedings of ACM International Conference on Multimedia (MM ’22). Lisboa, Portugal, 1612–1620.

Digital Library

[75]

J. Yu, Y. Cheng, R. W. Zhao, et al. 2022. MM-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing. In Proceedings of ACM International Conference on Multimedia (MM ’22). Lisboa, Portugal, 6241–6249.

Digital Library

[76]

S. Yu, J. Guo, R. Zhang, et al. 2022. A re-balancing strategy for class-imbalanced classification based on instance difficulty. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22). 70–79.

[77]

Y. Yu, S. Tang, F. Raposo, et al. 2019. Deep cross-modal correlation learning for audio and lyrics in music retrieval. ACM Trans. Multimedia Comput. Commun. Appl. 15, 1, Article 20 (feb2019), 16 pages.

Digital Library

[78]

W. Zhang, F. Qiu, S. Wang, et al. 2022. Transformer-based multimodal information fusion for facial expression analysis. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’22). New Orleans, LA, USA, 2428–2437.

[79]

H. Zhao, C. Gan, A. Rouditchenko, et al. 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV’18).

Digital Library

[80]

J. Zhou, D. Zhang, and W. Zhang. 2022. Underwater image enhancement method via multi-feature prior fusion. Applied Intelligence (2022), 1–23.

Digital Library

[81]

T. Zhou, J. Si, L. Wang, et al. 2022. Automatic detection of underwater small targets using forward-looking sonar images. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–12.

Cited By

Hu WYang YHu H(2024)Pseudo Label Association and Prototype-Based Invariant Learning for Semi-Supervised NIR-VIS Face RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.336453033(1448-1463)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3364530
Yang YHu WHu H(2024)Unsupervised NIR-VIS Face Recognition via Homogeneous-to-Heterogeneous Learning and Residual-Invariant EnhancementIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334617619(2112-2126)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIFS.2023.3346176

Index Terms

CMAF: Cross-Modal Augmentation via Fusion for Underwater Acoustic Image Recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

Underwater Acoustic Image Processing
Label graph learning for multi-label image recognition with cross-modal fusion
Abstract
It has become popular to learn the correlation between labels in most existing multi-label image recognition tasks. Existing approaches begin to construct a label graph to learn the label dependencies but they suffer from a low convergence ...
A Review of Sonar Image Segmentation for Underwater Small Targets
PRIS '20: Proceedings of the 2020 International Conference on Pattern Recognition and Intelligent Systems

The existing image segmentation methods are various, but due to the particularity of sonar images, the ordinary image segmentation methods often fail to achieve ideal results when processing sonar images and have limitations. On the basis of studying a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 5

May 2024

650 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3613634

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2024

Online AM: 06 December 2023

Accepted: 27 November 2023

Revised: 17 September 2023

Received: 01 April 2023

Published in TOMM Volume 20, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
313
Total Downloads

Downloads (Last 12 months)313
Downloads (Last 6 weeks)40

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hu WYang YHu H(2024)Pseudo Label Association and Prototype-Based Invariant Learning for Semi-Supervised NIR-VIS Face RecognitionIEEE Transactions on Image Processing10.1109/TIP.2024.336453033(1448-1463)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3364530
Yang YHu WHu H(2024)Unsupervised NIR-VIS Face Recognition via Homogeneous-to-Heterogeneous Learning and Residual-Invariant EnhancementIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.334617619(2112-2126)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TIFS.2023.3346176

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents