Abstract
Recent foundation models trained on a tremendous scale of data have shown great promise in a wide range of computer vision tasks and application domains. However, less attention has been paid to the marine realms, which in contrast cover the majority of our blue planet. The scarcity of labeled data is the most hindering issue, and marine photographs illustrate significantly different appearances and contents from general in-air images. Using existing foundation models for marine visual analysis does not yield satisfactory performance, due to not only the data distribution shift, but also the intrinsic limitations of the existing foundation models (e.g., lacking semantics, redundant mask generation, or restricted to image-level scene understanding). In this work, we emphasize both model and data approaches for understanding marine ecosystems. We introduce MarineInst, a foundation model for the analysis of the marine realms with instance visual description, which outputs instance masks and captions for marine object instances. To train MarineInst, we acquire MarineInst20M, the largest marine image dataset to date, which contains a wide spectrum of marine images with high-quality semantic instance masks constructed by a mixture of human-annotated instance masks and model-generated instance masks from our automatic procedure of binary instance filtering. To generate informative and detailed semantic instance captions, we use vision-language models to produce semantic richness with various granularities. Our model and dataset support a wide range of marine visual analysis tasks, from image-level scene understanding to regional mask-level instance understanding. More significantly, MarineInst exhibits strong generalization ability and flexibility to support a wide range of downstream tasks with state-of-the-art performance as demonstrated in Fig. 1. Project website: https://marineinst.hkustvgd.com.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Flickr. https://www.flickr.com/
Getty Images. https://www.gettyimages.com/
Shutterstock. https://www.shutterstock.com/
Encyclopedia of life (2018). http://eol.org
Akkaynak, D., Treibitz, T.: Sea-Thru: a method for removing water from underwater images. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1682–1691 (2019)
Alawode, B., et al.: UTB180: a high-quality benchmark for underwater tracking. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds.) Computer Vision, ACCV 2022. LNCS, vol. 13845, pp. 3326–3342. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-26348-4_26
Beijbom, O., et al.: Towards automated annotation of benthic survey images: variability of human experts and operational modes of automation. PLoS ONE 10(7), e0130312 (2015)
Bovcon, B., Muhovič, J., Perš, J., Kristan, M.: The MaSTr1325 dataset for training deep USV obstacle detection models. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3431–3438. IEEE (2019)
Chen, J., Yang, Z., Zhang, L.: Semantic segment anything (2023). https://github.com/fudan-zvg/Semantic-Segment-Anything
Cheng, Y., et al.: FloW: a dataset and benchmark for floating waste detection in inland waters. In: IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10953–10962 (2021)
Chiang, W.L., et al.: Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality (2023). https://vicuna.lmsys.org
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009)
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fan, B., Chen, W., Cong, Y., Tian, J.: Dual refinement underwater object detection network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 275–291. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_17
Fulton, M., Hong, J., Islam, M.J., Sattar, J.: Robotic detection of marine litter using deep visual detection models. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 5752–5758. IEEE (2019)
Gunasekar, S., et al.: Textbooks are all you need. arXiv preprint arXiv:2306.11644 (2023)
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5356–5364 (2019)
Haixin, L., Ziqiang, Z., Zeyu, M., Yeung, S.K.: MarineDet: towards open-marine object detection. arXiv preprint arXiv:2310.01931 (2023)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Hong, J., Fulton, M., Sattar, J.: TrashCan: a semantically-segmented dataset towards visual detection of marine debris. arXiv preprint arXiv:2007.08097 (2020)
Hong, L., Wang, X., Zhang, G., Zhao, M.: USOD10K: a new benchmark dataset for underwater salient object detection. IEEE Trans. Image Process. (TIP) (2023)
Huynh, D., Kuen, J., Lin, Z., Gu, J., Elhamifar, E.: Open-vocabulary instance segmentation via robust cross-modal pseudo-labeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7020–7031 (2022)
Islam, M.J., et al.: Semantic segmentation of underwater imagery: dataset and benchmark. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1769–1776. IEEE (2020)
Islam, M.J., Wang, R., Sattar, J.: SVAM: saliency-guided visual attention modeling by autonomous underwater robots. In: Robotics: Science and Systems (2022)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning (ICML), pp. 4904–4916. PMLR (2021)
Khan, F.F., Li, X., Temple, A.J., Elhoseiny, M.: FishNet: a large-scale dataset and benchmark for fish recognition, detection, and functional trait prediction. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 20496–20506 (2023)
Kim, D., Angelova, A., Kuo, W.: Region-aware pretraining for open-vocabulary object detection with vision transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11144–11154 (2023)
Kirillov, A., et al.: Segment anything. In: IEEE/CVF International Conference on Computer Vision (ICCV) (2023)
Li, C., et al.: ELEVATER: a benchmark and toolkit for evaluating language-augmented visual models. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 9287–9301 (2022)
Li, F., et al.: Semantic-SAM: segment and recognize anything at any granularity. arXiv preprint arXiv:2307.04767 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: International Conference on Machine Learning (ICML) (2023)
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (ICML), pp. 12888–12900. PMLR (2022)
Li, L., Dong, B., Rigall, E., Zhou, T., Dong, J., Chen, G.: Marine animal segmentation. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 32(4), 2303–2314 (2021)
Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2359–2367 (2017)
Lian, S., Li, H., Cong, R., Li, S., Zhang, W., Kwong, S.: WaterMask: instance segmentation for underwater imagery. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1305–1315 (2023)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lin, X., Sanket, N.J., Karapetyan, N., Aloimonos, Y.: OysterNet: enhanced oyster detection using simulation. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 5170–5176. IEEE (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Neural Information Processing Systems (NeurIPS) (2023)
Liu, S., et al.: Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Marques, T.P., Albu, A.B.: L2UWE: a framework for the efficient enhancement of low-light underwater images using local contrast and multi-scale fusion. In: IEEE/CVF conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 538–539 (2020)
OpenAI: Introducing ChatGPT (2022). https://openai.com/blog/chatgpt
OpenAI: GPT-4 technical report (2023)
Palnitkar, A., Kapu, R., Lin, X., Liu, C., Karapetyan, N., Aloimonos, Y.: ChatSim: underwater simulation with natural language prompting. arXiv preprint arXiv:2308.04029 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), pp. 8748–8763. PMLR (2021)
Ren, T., et al.: Grounded SAM: assembling open-world models for diverse visual tasks (2024)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 10684–10695 (2022)
Schuhmann, C., et al.: LAION-400M: open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
Shao, S., et al.: Objects365: a large-scale, high-quality dataset for object detection. In: IEEE/CVF International Conference on Computer Vision (CVPR), pp. 8430–8439 (2019)
Shi, Z., et al.: Detecting marine organisms via joint attention-relation learning for marine video surveillance. IEEE J. Oceanic Eng. 47(4), 959–974 (2022)
Sun, G., et al.: Indiscernible object counting in underwater scenes. In: IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
Sun, Z., et al.: Alpha-CLIP: A CLIP model focusing on wherever you want. arXiv preprint arXiv:2312.03818 (2023)
Tang, L., Xiao, H., Li, B.: Can SAM segment anything? When SAM meets camouflaged object detection. arXiv preprint arXiv:2304.04709 (2023)
Tebbett, S.B., Connolly, S.R., Bellwood, D.R.: Benthic composition changes on coral reefs at global scales. Nat. Ecol. Evol. 7(1), 71–81 (2023)
Truong, Q.T., et al.: Marine video kit: a new marine video dataset for content-based analysis and retrieval. In: Dang-Nguyen, DT., et al. (eds.) MultiMedia Modeling, MMM 2023. LNCS, vol. 13833, pp. 539–550. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_42
Urbanek, J., Bordes, F., Astolfi, P., Williamson, M., Sharma, V., Romero-Soriano, A.: A picture is worth more than 77 text tokens: evaluating clip-style models on dense captions. arXiv preprint arXiv:2312.08578 (2023)
Varghese, N., Kumar, A., Rajagopalan, A.: Self-supervised monocular underwater depth recovery, image restoration, and a real-sea video dataset. In: IEEE/CVF International Conference on Computer Vision (CVPR), pp. 12248–12258 (2023)
Xu, H., et al.: Demystifying clip data. arXiv preprint arXiv:2309.16671 (2023)
Xu, X., Xiong, T., Ding, Z., Tu, Z.: MasQCLIP for open-vocabulary universal image segmentation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 887–898 (2023)
Yu, Q., He, J., Deng, X., Shen, X., Chen, L.C.: Convolutions die hard: open-vocabulary segmentation with single frozen convolutional clip. arXiv preprint arXiv:2308.02487 (2023)
Yuan, H., Li, X., Zhou, C., Li, Y., Chen, K., Loy, C.C.: Open-vocabulary SAM: segment and recognize twenty-thousand classes interactively. arXiv preprint (2024)
Zhang, X., Zeng, H., Liu, X., Yu, Z., Zheng, H., Zheng, B.: In situ holothurian noncontact counting system: a general framework for holothurian counting. IEEE Access 8, 210041–210053 (2020)
Zheng, Z., et al.: Exploring boundary of GPT-4V on marine analysis: a preliminary case study. arXiv preprint arXiv:2401.02147 (2024)
Zheng, Z., et al.: Marine video cloud: a cloud-based video analytics platform for collaborative marine research. In: OCEANS, pp. 1–6. IEEE (2023)
Zheng, Z., et al.: CoralSCOP: segment any coral image on this planet. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 28170–28180 (2024)
Zheng, Z., Xin, Z., Yu, Z., Yeung, S.K.: Real-time GAN-based image enhancement for robust underwater monocular SLAM. Front. Marine Sci. 10 (2023)
Zheng, Z., Zhang, J., Vu, T.A., Diao, S., Tim, Y.H.W., Yeung, S.K.: MarineGPT: unlocking secrets of ocean to the public. arXiv preprint arXiv:2310.13596 (2023)
Zhong, Y., et al.: RegionCLIP: region-based language-image pretraining. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16793–16803 (2022)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 633–641 (2017)
Zhou, C., Loy, C.C., Dai, B.: Extract free dense labels from CLIP. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision, ECCV 2022. LNCS, vol. 13688, pp. 696–712. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_40
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zhuang, P., Wang, Y., Qiao, Y.: WildFish: a large benchmark for fish recognition in the wild. In: ACM International Conference on Multimedia (ACM MM), pp. 1301–1309 (2018)
Zhuang, P., Wang, Y., Qiao, Y.: WildFish++: a comprehensive fish benchmark for multimedia research. IEEE Trans. Multimedia (TMM) 23, 3603–3617 (2020)
Ziqiang, Z., Yaofeng, X., Haixin, L., Zhibin, Y., Yeung, S.K.: CoralVOS: dataset and benchmark for coral video segmentation. arXiv preprint arXiv:2310.01946 (2023)
Acknowledgment
We thank the valuable discussions and suggestions from Yang Wu and Jianbo Shi. We also thank all the volunteer students who helped do the annotations and evaluations. The work was partially supported by the Innovation and Technology Support Programme of the Innovation and Technology Fund (Ref: ITS/200/20FP) and the Marine Conservation Enhancement Fund (MCEF20107 and MCEF23EG01) and an internal grant from HKUST (R9429). Binh-Son Hua is supported by the Science Foundation Ireland under the SFI Frontiers for the Future Programme (22/FFP-P/11522).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zheng, Z., Chen, Y., Zeng, H., Vu, TA., Hua, BS., Yeung, SK. (2025). MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-72627-9_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72626-2
Online ISBN: 978-3-031-72627-9
eBook Packages: Computer ScienceComputer Science (R0)