Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Recent foundation models trained on a tremendous scale of data have shown great promise in a wide range of computer vision tasks and application domains. However, less attention has been paid to the marine realms, which in contrast cover the majority of our blue planet. The scarcity of labeled data is the most hindering issue, and marine photographs illustrate significantly different appearances and contents from general in-air images. Using existing foundation models for marine visual analysis does not yield satisfactory performance, due to not only the data distribution shift, but also the intrinsic limitations of the existing foundation models (e.g., lacking semantics, redundant mask generation, or restricted to image-level scene understanding). In this work, we emphasize both model and data approaches for understanding marine ecosystems. We introduce MarineInst, a foundation model for the analysis of the marine realms with instance visual description, which outputs instance masks and captions for marine object instances. To train MarineInst, we acquire MarineInst20M, the largest marine image dataset to date, which contains a wide spectrum of marine images with high-quality semantic instance masks constructed by a mixture of human-annotated instance masks and model-generated instance masks from our automatic procedure of binary instance filtering. To generate informative and detailed semantic instance captions, we use vision-language models to produce semantic richness with various granularities. Our model and dataset support a wide range of marine visual analysis tasks, from image-level scene understanding to regional mask-level instance understanding. More significantly, MarineInst exhibits strong generalization ability and flexibility to support a wide range of downstream tasks with state-of-the-art performance as demonstrated in Fig. 1. Project website: https://marineinst.hkustvgd.com.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Flickr. https://www.flickr.com/

  2. Getty Images. https://www.gettyimages.com/

  3. Shutterstock. https://www.shutterstock.com/

  4. Encyclopedia of life (2018). http://eol.org

  5. Akkaynak, D., Treibitz, T.: Sea-Thru: a method for removing water from underwater images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1682–1691 (2019)

    Google Scholar 

  6. Alawode, B., et al.: UTB180: a high-quality benchmark for underwater tracking. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds.) Computer Vision, ACCV 2022. LNCS, vol. 13845, pp. 3326–3342. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-26348-4_26

  7. Beijbom, O., et al.: Towards automated annotation of benthic survey images: variability of human experts and operational modes of automation. PLoS ONE 10(7), e0130312 (2015)

    Article  Google Scholar 

  8. Bovcon, B., Muhovič, J., Perš, J., Kristan, M.: The MaSTr1325 dataset for training deep USV obstacle detection models. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3431–3438. IEEE (2019)

    Google Scholar 

  9. Chen, J., Yang, Z., Zhang, L.: Semantic segment anything (2023). https://github.com/fudan-zvg/Semantic-Segment-Anything

  10. Cheng, Y., et al.: FloW: a dataset and benchmark for floating waste detection in inland waters. In: IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10953–10962 (2021)

    Google Scholar 

  11. Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org

  12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)

    Google Scholar 

  13. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  14. Fan, B., Chen, W., Cong, Y., Tian, J.: Dual refinement underwater object detection network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 275–291. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_17

    Chapter  Google Scholar 

  15. Fulton, M., Hong, J., Islam, M.J., Sattar, J.: Robotic detection of marine litter using deep visual detection models. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 5752–5758. IEEE (2019)

    Google Scholar 

  16. Gunasekar, S., et al.: Textbooks are all you need. arXiv preprint arXiv:2306.11644 (2023)

  17. Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5356–5364 (2019)

    Google Scholar 

  18. Haixin, L., Ziqiang, Z., Zeyu, M., Yeung, S.K.: MarineDet: towards open-marine object detection. arXiv preprint arXiv:2310.01931 (2023)

  19. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  20. Hong, J., Fulton, M., Sattar, J.: TrashCan: a semantically-segmented dataset towards visual detection of marine debris. arXiv preprint arXiv:2007.08097 (2020)

  21. Hong, L., Wang, X., Zhang, G., Zhao, M.: USOD10K: a new benchmark dataset for underwater salient object detection. IEEE Trans. Image Process. (TIP) (2023)

    Google Scholar 

  22. Huynh, D., Kuen, J., Lin, Z., Gu, J., Elhamifar, E.: Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7020–7031 (2022)

    Google Scholar 

  23. Islam, M.J., et al.: Semantic segmentation of underwater imagery: dataset and benchmark. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1769–1776. IEEE (2020)

    Google Scholar 

  24. Islam, M.J., Wang, R., Sattar, J.: SVAM: saliency-guided visual attention modeling by autonomous underwater robots. In: Robotics: Science and Systems (2022)

    Google Scholar 

  25. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning (ICML), pp. 4904–4916. PMLR (2021)

    Google Scholar 

  26. Khan, F.F., Li, X., Temple, A.J., Elhoseiny, M.: FishNet: a large-scale dataset and benchmark for fish recognition, detection, and functional trait prediction. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20496–20506 (2023)

    Google Scholar 

  27. Kim, D., Angelova, A., Kuo, W.: Region-aware pretraining for open-vocabulary object detection with vision transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11144–11154 (2023)

    Google Scholar 

  28. Kirillov, A., et al.: Segment anything. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  29. Li, C., et al.: ELEVATER: a benchmark and toolkit for evaluating language-augmented visual models. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 9287–9301 (2022)

    Google Scholar 

  30. Li, F., et al.: Semantic-SAM: segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767 (2023)

  31. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (ICML) (2023)

    Google Scholar 

  32. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (ICML), pp. 12888–12900. PMLR (2022)

    Google Scholar 

  33. Li, L., Dong, B., Rigall, E., Zhou, T., Dong, J., Chen, G.: Marine animal segmentation. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 32(4), 2303–2314 (2021)

    Article  Google Scholar 

  34. Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2359–2367 (2017)

    Google Scholar 

  35. Lian, S., Li, H., Cong, R., Li, S., Zhang, W., Kwong, S.: WaterMask: instance segmentation for underwater imagery. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1305–1315 (2023)

    Google Scholar 

  36. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  37. Lin, X., Sanket, N.J., Karapetyan, N., Aloimonos, Y.: OysterNet: enhanced oyster detection using simulation. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 5170–5176. IEEE (2023)

    Google Scholar 

  38. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  39. Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)

  40. Marques, T.P., Albu, A.B.: L2UWE: a framework for the efficient enhancement of low-light underwater images using local contrast and multi-scale fusion. In: IEEE/CVF conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 538–539 (2020)

    Google Scholar 

  41. OpenAI: Introducing ChatGPT (2022). https://openai.com/blog/chatgpt

  42. OpenAI: GPT-4 technical report (2023)

    Google Scholar 

  43. Palnitkar, A., Kapu, R., Lin, X., Liu, C., Karapetyan, N., Aloimonos, Y.: ChatSim: underwater simulation with natural language prompting. arXiv preprint arXiv:2308.04029 (2023)

  44. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), pp. 8748–8763. PMLR (2021)

    Google Scholar 

  45. Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks (2024)

    Google Scholar 

  46. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10684–10695 (2022)

    Google Scholar 

  47. Schuhmann, C., et al.: LAION-400M: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)

  48. Shao, S., et al.: Objects365: a large-scale, high-quality dataset for object detection. In: IEEE/CVF International Conference on Computer Vision (CVPR), pp. 8430–8439 (2019)

    Google Scholar 

  49. Shi, Z., et al.: Detecting marine organisms via joint attention-relation learning for marine video surveillance. IEEE J. Oceanic Eng. 47(4), 959–974 (2022)

    Article  Google Scholar 

  50. Sun, G., et al.: Indiscernible object counting in underwater scenes. In: IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  51. Sun, Z., et al.: Alpha-CLIP: A CLIP model focusing on wherever you want. arXiv preprint arXiv:2312.03818 (2023)

  52. Tang, L., Xiao, H., Li, B.: Can SAM segment anything? When SAM meets camouflaged object detection. arXiv preprint arXiv:2304.04709 (2023)

  53. Tebbett, S.B., Connolly, S.R., Bellwood, D.R.: Benthic composition changes on coral reefs at global scales. Nat. Ecol. Evol. 7(1), 71–81 (2023)

    Article  Google Scholar 

  54. Truong, Q.T., et al.: Marine video kit: a new marine video dataset for content-based analysis and retrieval. In: Dang-Nguyen, DT., et al. (eds.) MultiMedia Modeling, MMM 2023. LNCS, vol. 13833, pp. 539–550. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_42

  55. Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A picture is worth more than 77 text tokens: evaluating clip-style models on dense captions. arXiv preprint arXiv:2312.08578 (2023)

  56. Varghese, N., Kumar, A., Rajagopalan, A.: Self-supervised monocular underwater depth recovery, image restoration, and a real-sea video dataset. In: IEEE/CVF International Conference on Computer Vision (CVPR), pp. 12248–12258 (2023)

    Google Scholar 

  57. Xu, H., et al.: Demystifying clip data. arXiv preprint arXiv:2309.16671 (2023)

  58. Xu, X., Xiong, T., Ding, Z., Tu, Z.: MasQCLIP for open-vocabulary universal image segmentation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 887–898 (2023)

    Google Scholar 

  59. Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: open-vocabulary segmentation with single frozen convolutional clip. arXiv preprint arXiv:2308.02487 (2023)

  60. Yuan, H., Li, X., Zhou, C., Li, Y., Chen, K., Loy, C.C.: Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively. arXiv preprint (2024)

    Google Scholar 

  61. Zhang, X., Zeng, H., Liu, X., Yu, Z., Zheng, H., Zheng, B.: In situ holothurian noncontact counting system: a general framework for holothurian counting. IEEE Access 8, 210041–210053 (2020)

    Article  Google Scholar 

  62. Zheng, Z., et al.: Exploring boundary of GPT-4V on marine analysis: a preliminary case study. arXiv preprint arXiv:2401.02147 (2024)

  63. Zheng, Z., et al.: Marine video cloud: a cloud-based video analytics platform for collaborative marine research. In: OCEANS, pp. 1–6. IEEE (2023)

    Google Scholar 

  64. Zheng, Z., et al.: CoralSCOP: segment any coral image on this planet. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 28170–28180 (2024)

    Google Scholar 

  65. Zheng, Z., Xin, Z., Yu, Z., Yeung, S.K.: Real-time GAN-based image enhancement for robust underwater monocular SLAM. Front. Marine Sci. 10 (2023)

    Google Scholar 

  66. Zheng, Z., Zhang, J., Vu, T.A., Diao, S., Tim, Y.H.W., Yeung, S.K.: MarineGPT: unlocking secrets of ocean to the public. arXiv preprint arXiv:2310.13596 (2023)

  67. Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16793–16803 (2022)

    Google Scholar 

  68. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 633–641 (2017)

    Google Scholar 

  69. Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40

  70. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  71. Zhuang, P., Wang, Y., Qiao, Y.: WildFish: a large benchmark for fish recognition in the wild. In: ACM International Conference on Multimedia (ACM MM), pp. 1301–1309 (2018)

    Google Scholar 

  72. Zhuang, P., Wang, Y., Qiao, Y.: WildFish++: a comprehensive fish benchmark for multimedia research. IEEE Trans. Multimedia (TMM) 23, 3603–3617 (2020)

    Article  Google Scholar 

  73. Ziqiang, Z., Yaofeng, X., Haixin, L., Zhibin, Y., Yeung, S.K.: CoralVOS: dataset and benchmark for coral video segmentation. arXiv preprint arXiv:2310.01946 (2023)

Download references

Acknowledgment

We thank the valuable discussions and suggestions from Yang Wu and Jianbo Shi. We also thank all the volunteer students who helped do the annotations and evaluations. The work was partially supported by the Innovation and Technology Support Programme of the Innovation and Technology Fund (Ref: ITS/200/20FP) and the Marine Conservation Enhancement Fund (MCEF20107 and MCEF23EG01) and an internal grant from HKUST (R9429). Binh-Son Hua is supported by the Science Foundation Ireland under the SFI Frontiers for the Future Programme (22/FFP-P/11522).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziqiang Zheng .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 44363 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zheng, Z., Chen, Y., Zeng, H., Vu, TA., Hua, BS., Yeung, SK. (2025). MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72627-9_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72626-2

  • Online ISBN: 978-3-031-72627-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics