MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description

Zheng, Ziqiang; Chen, Yiwei; Zeng, Huimin; Vu, Tuan-Anh; Hua, Binh-Son; Yeung, Sai-Kit

doi:10.1007/978-3-031-72627-9_14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15060))

Included in the following conference series:

European Conference on Computer Vision

429 Accesses

Abstract

Recent foundation models trained on a tremendous scale of data have shown great promise in a wide range of computer vision tasks and application domains. However, less attention has been paid to the marine realms, which in contrast cover the majority of our blue planet. The scarcity of labeled data is the most hindering issue, and marine photographs illustrate significantly different appearances and contents from general in-air images. Using existing foundation models for marine visual analysis does not yield satisfactory performance, due to not only the data distribution shift, but also the intrinsic limitations of the existing foundation models (e.g., lacking semantics, redundant mask generation, or restricted to image-level scene understanding). In this work, we emphasize both model and data approaches for understanding marine ecosystems. We introduce MarineInst, a foundation model for the analysis of the marine realms with instance visual description, which outputs instance masks and captions for marine object instances. To train MarineInst, we acquire MarineInst20M, the largest marine image dataset to date, which contains a wide spectrum of marine images with high-quality semantic instance masks constructed by a mixture of human-annotated instance masks and model-generated instance masks from our automatic procedure of binary instance filtering. To generate informative and detailed semantic instance captions, we use vision-language models to produce semantic richness with various granularities. Our model and dataset support a wide range of marine visual analysis tasks, from image-level scene understanding to regional mask-level instance understanding. More significantly, MarineInst exhibits strong generalization ability and flexibility to support a wide range of downstream tasks with state-of-the-art performance as demonstrated in Fig. 1. Project website: https://marineinst.hkustvgd.com.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Parallel desires: unifying local and semantic feature representations in marine species images for classification

Article 01 July 2024

WildCLIP: Scene and Animal Attribute Retrieval from Camera Trap Data with Domain-Adapted Vision-Language Models

Article Open access 24 April 2024

Fine-grained object recognition in underwater visual data

Article 24 May 2015

References

Flickr. https://www.flickr.com/
Getty Images. https://www.gettyimages.com/
Shutterstock. https://www.shutterstock.com/
Encyclopedia of life (2018). http://eol.org
Akkaynak, D., Treibitz, T.: Sea-Thru: a method for removing water from underwater images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1682–1691 (2019)
Google Scholar
Alawode, B., et al.: UTB180: a high-quality benchmark for underwater tracking. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds.) Computer Vision, ACCV 2022. LNCS, vol. 13845, pp. 3326–3342. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-26348-4_26
Beijbom, O., et al.: Towards automated annotation of benthic survey images: variability of human experts and operational modes of automation. PLoS ONE 10(7), e0130312 (2015)
Article Google Scholar
Bovcon, B., Muhovič, J., Perš, J., Kristan, M.: The MaSTr1325 dataset for training deep USV obstacle detection models. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3431–3438. IEEE (2019)
Google Scholar
Chen, J., Yang, Z., Zhang, L.: Semantic segment anything (2023). https://github.com/fudan-zvg/Semantic-Segment-Anything
Cheng, Y., et al.: FloW: a dataset and benchmark for floating waste detection in inland waters. In: IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10953–10962 (2021)
Google Scholar
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fan, B., Chen, W., Cong, Y., Tian, J.: Dual refinement underwater object detection network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 275–291. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_17
Chapter Google Scholar
Fulton, M., Hong, J., Islam, M.J., Sattar, J.: Robotic detection of marine litter using deep visual detection models. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 5752–5758. IEEE (2019)
Google Scholar
Gunasekar, S., et al.: Textbooks are all you need. arXiv preprint arXiv:2306.11644 (2023)
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5356–5364 (2019)
Google Scholar
Haixin, L., Ziqiang, Z., Zeyu, M., Yeung, S.K.: MarineDet: towards open-marine object detection. arXiv preprint arXiv:2310.01931 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Hong, J., Fulton, M., Sattar, J.: TrashCan: a semantically-segmented dataset towards visual detection of marine debris. arXiv preprint arXiv:2007.08097 (2020)
Hong, L., Wang, X., Zhang, G., Zhao, M.: USOD10K: a new benchmark dataset for underwater salient object detection. IEEE Trans. Image Process. (TIP) (2023)
Google Scholar
Huynh, D., Kuen, J., Lin, Z., Gu, J., Elhamifar, E.: Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7020–7031 (2022)
Google Scholar
Islam, M.J., et al.: Semantic segmentation of underwater imagery: dataset and benchmark. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1769–1776. IEEE (2020)
Google Scholar
Islam, M.J., Wang, R., Sattar, J.: SVAM: saliency-guided visual attention modeling by autonomous underwater robots. In: Robotics: Science and Systems (2022)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning (ICML), pp. 4904–4916. PMLR (2021)
Google Scholar
Khan, F.F., Li, X., Temple, A.J., Elhoseiny, M.: FishNet: a large-scale dataset and benchmark for fish recognition, detection, and functional trait prediction. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20496–20506 (2023)
Google Scholar
Kim, D., Angelova, A., Kuo, W.: Region-aware pretraining for open-vocabulary object detection with vision transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11144–11154 (2023)
Google Scholar
Kirillov, A., et al.: Segment anything. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Li, C., et al.: ELEVATER: a benchmark and toolkit for evaluating language-augmented visual models. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 9287–9301 (2022)
Google Scholar
Li, F., et al.: Semantic-SAM: segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (ICML) (2023)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (ICML), pp. 12888–12900. PMLR (2022)
Google Scholar
Li, L., Dong, B., Rigall, E., Zhou, T., Dong, J., Chen, G.: Marine animal segmentation. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 32(4), 2303–2314 (2021)
Article Google Scholar
Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2359–2367 (2017)
Google Scholar
Lian, S., Li, H., Cong, R., Li, S., Zhang, W., Kwong, S.: WaterMask: instance segmentation for underwater imagery. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1305–1315 (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, X., Sanket, N.J., Karapetyan, N., Aloimonos, Y.: OysterNet: enhanced oyster detection using simulation. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 5170–5176. IEEE (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Marques, T.P., Albu, A.B.: L2UWE: a framework for the efficient enhancement of low-light underwater images using local contrast and multi-scale fusion. In: IEEE/CVF conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 538–539 (2020)
Google Scholar
OpenAI: Introducing ChatGPT (2022). https://openai.com/blog/chatgpt
OpenAI: GPT-4 technical report (2023)
Google Scholar
Palnitkar, A., Kapu, R., Lin, X., Liu, C., Karapetyan, N., Aloimonos, Y.: ChatSim: underwater simulation with natural language prompting. arXiv preprint arXiv:2308.04029 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), pp. 8748–8763. PMLR (2021)
Google Scholar
Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks (2024)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10684–10695 (2022)
Google Scholar
Schuhmann, C., et al.: LAION-400M: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Shao, S., et al.: Objects365: a large-scale, high-quality dataset for object detection. In: IEEE/CVF International Conference on Computer Vision (CVPR), pp. 8430–8439 (2019)
Google Scholar
Shi, Z., et al.: Detecting marine organisms via joint attention-relation learning for marine video surveillance. IEEE J. Oceanic Eng. 47(4), 959–974 (2022)
Article Google Scholar
Sun, G., et al.: Indiscernible object counting in underwater scenes. In: IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Google Scholar
Sun, Z., et al.: Alpha-CLIP: A CLIP model focusing on wherever you want. arXiv preprint arXiv:2312.03818 (2023)
Tang, L., Xiao, H., Li, B.: Can SAM segment anything? When SAM meets camouflaged object detection. arXiv preprint arXiv:2304.04709 (2023)
Tebbett, S.B., Connolly, S.R., Bellwood, D.R.: Benthic composition changes on coral reefs at global scales. Nat. Ecol. Evol. 7(1), 71–81 (2023)
Article Google Scholar
Truong, Q.T., et al.: Marine video kit: a new marine video dataset for content-based analysis and retrieval. In: Dang-Nguyen, DT., et al. (eds.) MultiMedia Modeling, MMM 2023. LNCS, vol. 13833, pp. 539–550. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_42
Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A picture is worth more than 77 text tokens: evaluating clip-style models on dense captions. arXiv preprint arXiv:2312.08578 (2023)
Varghese, N., Kumar, A., Rajagopalan, A.: Self-supervised monocular underwater depth recovery, image restoration, and a real-sea video dataset. In: IEEE/CVF International Conference on Computer Vision (CVPR), pp. 12248–12258 (2023)
Google Scholar
Xu, H., et al.: Demystifying clip data. arXiv preprint arXiv:2309.16671 (2023)
Xu, X., Xiong, T., Ding, Z., Tu, Z.: MasQCLIP for open-vocabulary universal image segmentation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 887–898 (2023)
Google Scholar
Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: open-vocabulary segmentation with single frozen convolutional clip. arXiv preprint arXiv:2308.02487 (2023)
Yuan, H., Li, X., Zhou, C., Li, Y., Chen, K., Loy, C.C.: Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively. arXiv preprint (2024)
Google Scholar
Zhang, X., Zeng, H., Liu, X., Yu, Z., Zheng, H., Zheng, B.: In situ holothurian noncontact counting system: a general framework for holothurian counting. IEEE Access 8, 210041–210053 (2020)
Article Google Scholar
Zheng, Z., et al.: Exploring boundary of GPT-4V on marine analysis: a preliminary case study. arXiv preprint arXiv:2401.02147 (2024)
Zheng, Z., et al.: Marine video cloud: a cloud-based video analytics platform for collaborative marine research. In: OCEANS, pp. 1–6. IEEE (2023)
Google Scholar
Zheng, Z., et al.: CoralSCOP: segment any coral image on this planet. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 28170–28180 (2024)
Google Scholar
Zheng, Z., Xin, Z., Yu, Z., Yeung, S.K.: Real-time GAN-based image enhancement for robust underwater monocular SLAM. Front. Marine Sci. 10 (2023)
Google Scholar
Zheng, Z., Zhang, J., Vu, T.A., Diao, S., Tim, Y.H.W., Yeung, S.K.: MarineGPT: unlocking secrets of ocean to the public. arXiv preprint arXiv:2310.13596 (2023)
Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16793–16803 (2022)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 633–641 (2017)
Google Scholar
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhuang, P., Wang, Y., Qiao, Y.: WildFish: a large benchmark for fish recognition in the wild. In: ACM International Conference on Multimedia (ACM MM), pp. 1301–1309 (2018)
Google Scholar
Zhuang, P., Wang, Y., Qiao, Y.: WildFish++: a comprehensive fish benchmark for multimedia research. IEEE Trans. Multimedia (TMM) 23, 3603–3617 (2020)
Article Google Scholar
Ziqiang, Z., Yaofeng, X., Haixin, L., Zhibin, Y., Yeung, S.K.: CoralVOS: dataset and benchmark for coral video segmentation. arXiv preprint arXiv:2310.01946 (2023)

Download references

Acknowledgment

We thank the valuable discussions and suggestions from Yang Wu and Jianbo Shi. We also thank all the volunteer students who helped do the annotations and evaluations. The work was partially supported by the Innovation and Technology Support Programme of the Innovation and Technology Fund (Ref: ITS/200/20FP) and the Marine Conservation Enhancement Fund (MCEF20107 and MCEF23EG01) and an internal grant from HKUST (R9429). Binh-Son Hua is supported by the Science Foundation Ireland under the SFI Frontiers for the Future Programme (22/FFP-P/11522).

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology, Hong Kong, Hong Kong S.A.R., China
Ziqiang Zheng, Yiwei Chen, Tuan-Anh Vu & Sai-Kit Yeung
Northeastern University, Boston, USA
Huimin Zeng
Trinity College Dublin, Dublin, Ireland
Binh-Son Hua

Authors

Ziqiang Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Yiwei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Huimin Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Tuan-Anh Vu
View author publications
You can also search for this author in PubMed Google Scholar
Binh-Son Hua
View author publications
You can also search for this author in PubMed Google Scholar
Sai-Kit Yeung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziqiang Zheng .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 44363 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, Z., Chen, Y., Zeng, H., Vu, TA., Hua, BS., Yeung, SK. (2025). MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-72627-9_14
Published: 20 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72626-2
Online ISBN: 978-3-031-72627-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description