Exploring Boundary of GPT-4V on Marine Analysis: A Preliminary Case Study

Ziqiang Zheng¹, Yiwei Chen¹, Jipeng Zhang¹, Tuan-Anh Vu¹, Huimin Zeng², Yue Him Wong Tim³, Sai-Kit Yeung¹
¹The Hong Kong University of Science and Technology,
²University of Science and Technology of China, ³Shenzhen University
{zzhengaw,ychenmb,jzhanggr,tavu}@connect.ust.hk, saikit@ust.hk

Abstract

Large language models (LLMs) have demonstrated a powerful ability to answer various queries as a general-purpose assistant. The continuous multi-modal large language models (MLLM) empower LLMs with the ability to perceive visual signals. The launch of GPT-4 (Generative Pre-trained Transformers) has generated significant interest in the research communities. GPT-4V(ison) has demonstrated significant power in both academia and industry fields, as a focal point in a new artificial intelligence generation. Though significant success was achieved by GPT-4V, exploring MLLMs in domain-specific analysis (e.g., marine analysis) that required domain-specific knowledge and expertise has gained less attention. In this study, we carry out the preliminary and comprehensive case study of utilizing GPT-4V for marine analysis. This report conducts a systematic evaluation of existing GPT-4V, assessing the performance of GPT-4V on marine research and also setting a new standard for future developments in MLLMs. The experimental results of GPT-4V show that the responses generated by GPT-4V are still far away from satisfying the domain-specific requirements of the marine professions. All images and prompts used in this study will be available at https://github.com/hkust-vgd/Marine_GPT-4V_Eval

1 Introduction

Large language models (LLMs) (Raffel et al., 2020; Chiang et al., 2023; Zhang et al., 2022; Touvron et al., 2023a; b; Ouyang et al., 2022; OpenAI, 2023; Brown et al., 2020; Scao et al., 2022) demonstrated an impressive ability to handle a large range of user-tailored tasks. As a general-purpose assistant, ChatGPT/GPT-4 (OpenAI, 2023; Ouyang et al., 2022) could understand human intents and complete various real-world tasks. The development of multi-modal large language models (Li et al., 2023c; Zhu et al., 2023; Zheng et al., 2023c; Peng et al., 2023a; Team et al., 2023; Alayrac et al., 2022) (MLLMs) such as GPT-4V represents an important step towards more sophisticated AI systems with the ability to receive both textual inputs and visual data. The integration of vision in language models has marked a significant milestone. GPT-4V showcased impressive general-purpose visual understanding and reasoning abilities. The advent of GPT-4V has expanded AI applications, aligning with the multi-modal capabilities of the human brain. In detail, GPT-4V extends the abilities of GPT-4 to analyze and interpret images and has attracted significant attention across both academia and industry.

Existing open-source general-purpose MLLMs (Liu et al., 2023; Peng et al., 2023b; Li et al., 2023a) often lack in image-text analysis (Lu et al., 2022) due to limited model size and data scale. It is still unclear how GPT-4V, and MLLMs built on GPT-4, perform various multimodal understanding tasks. Though vision capabilities embodied in GPT-4 have pioneered new avenues for advanced image-text analysis, the challenges (Fu et al., 2023a; Singh et al., 2023) of evaluating how GPT-4V accurately perceives visual signals and measuring the effectiveness of such a system arise. To evaluate whether GPT-4V could achieve robust visual perception and mimic the inherently subjective and associative processes of human perception, recent studies (Yang et al., 2023; Zhang et al., 2023; Fu et al., 2023b; Ge et al., 2023; Bubeck et al., 2023) have been conducted to evaluate the performance of GPT-4V in different areas, such as recommendation (Zhou et al., 2023), medical analysis (Li et al., 2023b), radiological (Busch et al., 2023), mathematic (Gao et al., 2023), and general-purpose visual analysis tasks (Yang et al., 2023; Bubeck et al., 2023). Evaluating the performance of GPT-4V in these areas will provide insights into the flexibility of GPT-4V as the AI assistant. However, there are few attempts (Palnitkar et al., 2023; Zheng et al., 2023c) to utilize GPT-4V for more advanced analysis, which requires advanced and domain-specific knowledge and expertise.

To bridge this gap, we present a preliminary case study investigating the marine analysis based on GPT-4V. We explore whether GPT-4V could serve as an effective visual perception system and a professional expert for sensitive, informative, and accurate knowledge delivery. We construct a series of qualitative test samples spanning multiple purposes in the field of marine analysis and employ these samples to assess the quality of the responses generated by GPT-4V.

We propose to evaluate the performance of GPT-4V on marine analysis from the following aspects: perception, statistics, domain-specific question answering, marine culture understanding, advanced functions and prompt engineering. We pick up images that are not accessible online or private data, combined with manually crafted prompts to build the evaluation samples. Evaluation results on our constructed testing samples prove that GPT-4V has a remarkable OCR, event detection, and framework understanding ability across various conditions, due to its robust visual-text comprehension capabilities and extensive knowledge. However, we have also observed the intrinsic limitations of using GPT-4V for marine analysis. GPT-4V only demonstrates very limited fine-grained marine object recognition ability and is easily misled by meticulously forged filenames (we observe that GPT-4V will read the filenames of uploaded images as context prompts). Besides, GPT-4V cannot perform complicated object counting and detect all the objects within the visual images since it is mainly performing image-level understanding. GPT-4V also failed to accurately capture subtle details in images and respond with the required domain-specific information. We finally demonstrate that GPT-4V cannot conduct advanced marine analysis as a professional analysis tool. We summarize our findings as follows.

•

In this study, we embark on an in-depth analysis of GPT-4V on domain-specific marine analysis. The expert capacity of GPT-4V has been measured for applying the learned domain knowledge and skills to the professional domains. Our study holds significant importance for the marine research community, providing valuable insights and guidance for future exploration of utilizing MLLMs for domain-specific analysis.
•

We demonstrate several limitations of GPT-4V on marine analysis. Despite these limitations, we also aim to include a list of potential abilities of GPT-4V that we have identified as a domain-specific analysis tool. We hope that these explorations and our constructed domain-specific testing samples can offer valuable insights and serve as domain-specific benchmark data for evaluating MLLMs on domains with professional knowledge.
•

We also acknowledge GPT-4V could be easily misled by the wrong prompts (e.g., the filenames of visual images), demonstrating GPT-4V leans towards the text prompts and without looking at the visual elements within the images. The hallucination happens a lot when GPT-4V is asked to answer domain-specific questions.

2 Experiments

2.1 Approach

Data construction. To avoid the testing sample leakage, all the samples involved in this study are from different sources: 1) private data collection contributed by marine biologists (Zheng et al., 2023a); 2) manually cropped frames from YouTube videos; 3) Internet images posted after the release of GPT-4V APIs; 4) framework and flowchart images from research articles and books (Haixin et al., 2023; Ziqiang et al., 2023); and 5) images from public datasets (Beijbom et al., 2015) and our newly created images. To promote the consistency and reliability of our study and increase the robustness of our findings, we make sure that every case has at least 10 testing samples with high diversity.

Prompt design. GPT-4V has been demonstrated to support a diverse range of visual processing based on various signed prompts (Wang et al., 2022; Peng et al., 2023a). This inspires us to design the various prompts. Our prompts in this study are characterized by a rich diversity and complexity of instructions to enable GPT-4V to generate comprehensive and descriptive responses, which are aligned with the user intents.

Evaluation metric. In each testing case, we compute the accuracy of GPT-4V on a wide range of visual tasks. For those object recognition tasks with ground truth labeled by the domain experts, we evaluate whether GPT-4V could yield satisfactory object recognition performance according to the generated labels. For those evaluation metrics with human judgment involved, we mainly design two protocols (Zhang et al., 2023; Ge et al., 2023): pairwise comparison and image-based scoring. For pairwise comparison, we judge whether the two images come from the same identity or the same species. For pairwise scoring, we ask both GPT-4V and human labelers to generate scores on a scale of 1 to 10. The ground truths under the two protocols are both generated by human experts.

2.2 Perception

In this section, our goal is to assess the performance of GPT-4V in various challenging vision tasks. The involved tasks demand a powerful visual perception ability to understand the real world. Our experiments focus on the ability of GPT-4V to sense the visual contents and then perform image-level, object-level and attribute-level comprehension.

Refer to caption — Figure 1: The marine object recognition results under three different settings: left column (with random filename); middle column (with meticulously forged misleading filename); and right column (with meaningful and aligned filename). The texts in red represent the wrong responses and texts in green indicate the correct responses. The prompts are “Recognize the object in this figure”.

We first explore whether GPT-4V could really understand the visual content of the given marine images or just respond without looking at the visual signals. We perform experiments using the same images under three settings: 1) with random filename; 2) with meticulously forged misleading filename; and 3) with meaningful and aligned filename. The experimental results are illustrated in Figure 1. The filenames and the ground truths of the marine objects are also provided as references. As illustrated, we observe that GPT-4V will recognize the marine objects within the given image under the first setting since no side clues are provided. GPT-4V tends to describe all the appeared meaningful objects and usually yields longer responses. Under the second setting, with the misleading filename given, GPT-4V will respond according to the given file name and generate some “false promise” that does not appear in the image. GPT-4V could be easily deceived by the meticulously forged filenames and yield some wrong answers. We guess that GPT-4V would read the filename of the uploaded image and regard such filename as the context prompt when generating the responses. It will easily produce a hallucination if the wrong context prompts do not exist in the image. As for the final setting, when the correct and aligned filenames are given, GPT-4V could generate meaningful and satisfactory responses. However, we cannot claim that GPT-4V could really understand the visual contents of uploaded images since abstracted conception names have already leaked in the filenames. More inference results under the three settings are provided in Figure 2, Figure 3, and Figure 4, respectively.

Considering the conception leakage issue, we rename all the images in all our experiments to meaningless filenames to avoid information leakage and ensure fair testing.

2.2.1 Marine object recognition

Wide spectrum of marine object recognition. We first explore whether GPT-4V could recognize a wide range of marine objects. We pick up 300 different marine images that contain the salient visual elements from one single marine species. In other words, there are 300 different marine species involved in our experiments. These images are manually cropped from the Youtube videos or the MVK dataset (Truong et al., 2023; Zheng et al., 2023a) The ground truth of the appeared marine objects is labeled by domain experts and we manually compared the recognized object names with the ground truth for computing the recognition accuracy. Some marine object recognition results are provided in Figure 5. As illustrated, GPT-4V failed to accurately recognize marine objects that are not relatively common. There is still a very large room to improve the recognition accuracy of GPT-4V on marine object recognition.

Marine object recognition under challenging conditions. We then test whether GPT-4V is capable of depicting the key visual elements under some challenging conditions, including crowded scene, objects with weird appearances, fluffy object, irregular boundary, tiny object, camouflaged object, object detection under occlusion, low visibility, and optical artifacts. All the experimental results are reported in Figure 6 and Figure 7, respectively. For these testing experiments, we make sure there are at least 10 images under each experimental setting. We compute the recognition accuracy under those diverse settings. We observe that GPT-4V has a poor ability to accurately recognize the visual elements under challenging conditions. We guess that such failure of GPT-4V may be subject to the minority training data from the marine field. More training data collected under challenging conditions should be further included to promote the recognition ability of GPT-4V in challenging conditions.

2.2.2 Fine-grained marine object recognition

We test whether GPT-4V could discriminate very similar marine objects (e.g., fine-grained object recognition) and generate different responses based on given visual contents. We report the fine-grained object recognition results of GPT-4V in Figure 8. As demonstrated, GPT-4V failed to tell the differences of close-related marine objects with similar appearances. The fine-grained object recognition ability is required in the marine analysis field since it could enable diversity monitoring and reduce the human labor from the domain experts on species identification. There is still a far away from utilizing GPT-4V for marine species identification.

We then perform the pairwise comparing, formulating a pair of images and asking GPT-4V whether the objects within the two images belong to the same marine species. Figure 9 illustrates the pairwise comparing performance. We formulate 20 pairs and compute the correct rate of GPT-4V on this task. Cross-view fish re-identification. We have also performed experiments to ask the GPT-4V to judge whether the objects within the images captured under different camera views (e.g., frontal, bird and side views) are the same object. Figure 10 demonstrates that GPT-4V has a poor ability to retrieve objects with camera view changes. GPT-4V refused to respond to the matching question even though the two fishes from the two visual images share very different appearances.

2.2.3 Robustness Analysis

In this section, we test the robustness of GPT-4V in recognizing various formats of visual signals, such as the fisheye (Zheng et al., 2023b), 360 ${}^{\circ}$ (Huang et al., 2023), sonar (Xie et al., 2022) and Lidar images. Figure 11 illustrates the recognition results of GPT-4V on 360 ${}^{\circ}$ and fisheye images. GPT-4V could observe the distortion of 360 ${}^{\circ}$ images but cannot explicitly explain why the distortion happens. In most cases, it could accurately recognize the visual elements from the visual images, however, it seems to have hallucination on the components in the submarine images where the visibility is low and images tend to be more murky, showing its limited robustness to fisheye and 360 ${}^{\circ}$ images. What’s more, it is an expert at recognizing how the images are captured through the edge or border of the viewpoint. We report the further object recognition results of GPT-4V on sonar images and Lidar images in Figure 12. GPT-4V can recognize the general shape of the existing objects but cannot effectively detect what kind of stuff they are in sonar images due to the appearance shift. But for Lidar images in which objects’ appearance doesn’t shift a lot, GPT-4T can exactly describe the element in detail, showing a very good understanding of the image.

We then identify whether GPT-4V could effectively recognize object regions with highlighted masks as demonstrated in Figure 13, exploring the referring comprehension ability of GPT-4V. The partial parts of the whole image are highlighted by purple and we ask GPT-4V to identify the highlighted regions. Furthermore, GPT-4V is asked to compute the cover of the highlighted coral regions. GPT-4V could generate the Python codes to compute the cover statistics. However, GPT-4V would self-define the RGB value range of “purple” without explanation. However, such a definition could be wrong and cannot handle visual images with high complexity.

2.2.4 Physical World Knowledge Understanding

We finally explore whether GPT-4V could really understand the physical world knowledge, for example, the spatial, size, color and texture attributes of the existing objects within the images. We explore the capability of GPT-4V to apply common sense knowledge in understanding visual contents within images. We have investigated the models’ ability to comprehend visual information via the application of knowledge, which encompasses commonsense, subject knowledge, multicultural customs, and world knowledge. The results are illustrated in Figure 14. GPT-4V shows its strong capability of understanding the physical world knowledge like spatial, size and texture attributes and it also has great robustness to the wrong knowledge that does not correspond with the image and correct it. Even if we provide it with some really misleading images with close view of a dolphin and a far view of a blue whale, it could still correctly tell the real size of these objects.

2.3 Statistics

In this section, we aim to explore the ability of GPT-4V to perform visual statistics based on the visual contents, such as object counting and summarizing all the appeared objects within images.

2.3.1 Object counting

We perform object counting experiments under five settings: 1) fewer than 10 objects; 2) 10-20 objects; 3) 20-50 objects; 4) 50-100 objects and 5) more than 100 objects. All the qualitative results have been reported in Figure 15. As demonstrated, GPT-4V only demonstrates a limited ability to count the existing objects within the images, especially if the objects are occluded together or the objects are tiny. Meanwhile, since the GPT-4V directly yields the estimation results of objects without explicitly localizing the objects (e.g., bounding box), the estimation results will likely be not accurate. Furthermore, we have also observed that GPT-4V tends to generate an exact number of presented objects within the images when there are few objects visible. In contrast, GPT-4V instead yields a rough number of the object counting results. To avoid potential mistakes, GPT-4V outputs a range (e.g., more than 100) for the estimated objects. In summary, the external object detection tools for localizing the objects should be integrated to promote the object counting ability of MLLM.

2.3.2 Recognizing all the objects

We then explore the ability of GPT-4V to recognize all the existing objects within the given visual images and list the corresponding names of all the recognized objects. Figure 16 demonstrates the recognition results under the crowded and structured palette. The GPT-4V struggles to recognize all the objects within the images and only lists very few common object categories. Furthermore, we observe that GPT-4V could summarize the implicit intention of such visual images and try to summarize the relationships between the recognized objects. However, due to the large number of objects, some less commonly known species, and the low image resolution, GPT-4V shows a very limited performance on recognizing all the objects in one single image while it could still understand some general information of the image, like title, colors and common features of objects. Similar to the object counting task, GPT-4V tends to discard many objects within the images and only tries to recognize some common objects easy to recognize to avoid making mistakes, but this also makes it hard for GPT-4V to recognize all the objects existing in the image.

2.4 Domain-specific Question-Answering

we examine the ability of GPT-4V to apply knowledge in the fields of marine to understand visual images. We observe that GPT-4V possesses the relevant subject knowledge associated with the following cases.

Multiple choice questions. We first explore the ability of GPT-4V to answer the marine multiple-choice questions. We upload the manually written marine questions and corresponding choices to GPT-4V and ask GPT-4V to generate the answers in Figure 17. As demonstrated, GPT-4V has shown a strong optical character recognition (OCR) ability to extract the correct text information from the uploaded images and a satisfactory promise for handling basic marine knowledge. We have manually constructed 100 multiple-choice questions, which come from marine biology, oceanography, and geology. The accuracy of GPT-4V is computed to quantitatively assess the quality of GPT-4V in answering the domain-specific questions.

Domain-specific VQA. We evaluate whether GPT-4V could understand the user intent of the domain experts and the ability of GPT-4V for abstract visual reasoning and scientific problem-solving. Such abilities are required for marine researchers to analyze the data (figures and tables) collected to gain insights into various aspects of marine research fields. Results are reported in Figure 18 and Figure 19, respectively. As demonstrated in Figure 18, GPT-4V could understand most elements of the left scientific figure but make a tiny mistake about the temperature range. Besides, GPT-4V could understand the temporal changes within the scientific figure and conclude the implicit intention. It could accurately describe the coral status of each sub-figure and conclude the progression changes. We have also included more visual scientific examples essential for handling marine biology, engineering, oceanography, and etc.

Furthermore, we feed GPT-4V with scientific figures and tables from the field of marine engineering as reported in Figure 19. GPT-4V could effectively understand the flowchart. GPT-4V could describe the logic inside of the flowchart and respond with more reasoning details. GPT-4V could also understand the tables in detail. When being asked a question that requires intermediate reasoning procedures, GPT-4V could answer correctly with detailed reasoning procedures. However, GPT-4V still has difficulties in providing a precise answer in some cases, which is mainly constrained by the unsatisfactory OCR accuracy in Figure 19.

Multi-round conversation. We finally assess the ability of GPT-4V to support multi-round conversations. Users could ask different questions for comprehensive analysis, as demonstrated in Figure 20. Our study suggests that GPT-4V, could generate corresponding responses aligned with the user intent and cover the detailed information. However, GPT-4V struggles with the marine object recognition. With the wrongly identified marine objects, GPT-4V leads to error accumulation, which suggests that GPT-4V only responds based on the previously generated keywords (as the context prompt) without looking at the visual contents. How to alleviate the hallucination of MLLMs is a valuable and important future research direction.

2.5 Marine Cultural Understanding

We investigate the ability of GPT-4V to recognize logos, landmarks, artist images, and more in Figure 21, Figure 22, and Figure 22.

In Figure 21, GPT-4V could effectively recognize the globally known NOAA logo and yield a detailed description of the appearance of the logo. However, there is still a hallucination with the description of the NOAA logo. We guess the generated responses are from the training corpus of GPT-4V rather than being aligned with the visual elements. As for the novel logos, GPT-4V could describe the appearance of the designed logos. The feature patterns of the logos are comprehensively described and GPT-4V could assess the artistic and literary representations of themes and species.

We then ask GPT-4V to perform marine artist image recognition and description as illustrated in Figure 22. GPT-4V could efficiently describe the visual elements of marine artist images. We present the capacity of GPT-4V to depict the appearance of the cartoon images, paintings, and actual photographs. GPT-4V demonstrates a strong ability to assess the aesthetic quality of visual images and describe the partial parts of each image.

Finally, we report the landmark recognition performance of GPT-4V in Figure 23. GPT-4V can identify the marine vestige and statures. The detailed appearances of recognized ruins are further described in detail, demonstrating the strong ability of GPT-4V to perceive the visual images. However, GPT-4V cannot accurately discriminate the statures with irregular shapes and poses.

2.6 Advanced Functions

In this section, we aim to explore the possibility of utilizing GPT-4V for some advanced and complicated functions in the marine research field, such as coral coverage estimation, benthic composition statistic, multi-modal reasoning, relationship summarization, and etc.

2.6.1 Coral coverage estimation

Coral reefs are among the most biodiverse ecosystems on our planet and provide habitat for countless marine species. Monitoring coral coverage allows researchers to assess the overall health and condition of these ecosystems. In this section, we aim to explore the feasibility of utilizing GPT-4V for coral coverage estimation. Figure 24 represents some preliminary results of coral coverage estimation. GPT-4V avoids directly outputting the coral coverage and instead attempts to generate some computer vision processing codes for coral coverage estimation. The generated coral coverage is far away from the real ground truth. Besides, GPT-4V may lead to the ignorance of the tiny corals or the minority coral types and then result in wrong policy making.

We then examine the ability of GPT-4V to discriminate the coral reef composition from the visual images in Figure 25. GPT-4V could accurately recognize the coral reefs and missed the brain coral reefs. Moreover, we have also explored the ability of GPT-4V to understand the coral bleaching, which is linked to warming seas, can lead to declines in coral coverage. When being asked whether the coral reefs are bleached, GPT-4V has made a wrong judgment. GPT-4V cannot understand the meaning of “bleaching” and describes the degree of coral bleaching due to the lack of a reference color bar.

2.6.2 Benthic Composition

Understanding the benthic composition from the captured visual images could help researchers characterize and classify marine ecosystems based on the types of organisms and substrate present. Different benthic communities support distinct sets of species and play unique ecological roles. We explore the potential of utilizing GPT-4V to generate the benthic analysis data, which could be further used for monitoring the impact of factors like pollution, climate change, and habitat destruction. The results are illustrated in Figure 26. We first ask GPT-4V to generate the benthic composition data (the composition of non-creatures and creatures) from the uploaded visual image and then identify how many types of coral reefs. Furthermore, we examine the ability of GPT-4V for benthic invertebrate identification (e.g., corals, sponges, mollusks, and worms), algae, and even certain fish species.

Our experimental results show that GPT-4V nearly cannot achieve benthic composition statistics without utilizing an external professional analysis tool or being fed corresponding analysis data for final report generation. Even though GPT-4V could generate some very naive computer vision processing codes for analysis, the analyzed outputs are still very far from the requirement of a professional expert. Meanwhile, the whole processing and analysis procedure lacks the reasoning steps and support of the domain-specific evidence.

2.6.3 Relationship Summarization and Event Detection

Relationship summarization. Exploring the relationships between marine creatures allows conservationists to make informed decisions about protecting vulnerable or endangered species. In this section, we assess the ability of GPT-4V to comprehend how different creatures interact and summarize the relationship between them, such as predator-prey relationships, symbiosis, competition, and mutualism. Such summarized marine relationships could gain insights into the behavior, evolution, and adaptation of species. It is worth noting that we mainly focus on the relationship summarization from the perspective of marine biology research. The qualitative results are reported in Figure 27. As demonstrated, GPT-4V has shown a satisfactory ability to understand and describe some well-known relationships between recognized objects, such as the symbiotic relationship between clownfish and the sea anemone. But in contrast, when GPT-4V fails to recognize the marine objects accurately, it will generate totally irrelative responses, and the responses are nearly based on its “imagination”.

Event detection. Through event detection, domain experts could predict and mitigate the impacts of events like climate change and pollution. Some preliminary case studies about event detection are illustrated in Figure 28. We collect more samples about 1) identifying irregular behaviors, such as illegal fishing, vessel collisions, or suspicious activities, which can be crucial for maritime safety and security; 2) monitoring the changes of marine conditions, such as water levels, wave patterns, and coastal erosion; and 3) detecting abnormal events in marine images, which can help identify unusual events such as oil spills, coral bleaching, and marine pollution. Detecting these abnormalities early allows for a rapid response to mitigate environmental damage and protect marine ecosystems. The excitement of unveiling the unknown serves as a powerful motivator for researchers and explorers. From the early exploration as demonstrated in Figure 28, GPT-4V possesses a strong ability to understand the event presented in the visual images.

2.6.4 Framework and Flowchart Understanding

We test whether GPT-4V showcases some detailed reasoning procedures and the ability to understand the inside intention of the designed images, including the framework and flow chart images. GPT-4V is required to explain the whole framework step by step and describe the intermediate step in detail. We provide visual reasoning results of GPT-4V from various fields in Figure 29 (scientific figure understanding), Figure 30 (implicit intention understanding), and Figure 31 (the framework understanding), respectively. Our exploration targets how GPT-4V understands and reasons for the high-level information from the figures as a whole.

As shown in Figure 29, GPT-4V has demonstrated a very strong OCR ability to extract text information from visual images. It could summarize the hierarchical relationship between different parts and extract the key elements of the whole figure. Besides, GPT-4V can understand the structure information and guess the source and usage of the uploaded scientific images.

Furthermore, we observe that GPT-4V could understand the motivation of the illustration figures as demonstrated in Figure 30. It could accurately describe the inside motivation of drawn figures. However, we have also observed the hallucination of GPT-4V. It will generate some information that does not exist within the image based on some extracted keywords (e.g., “DAVIS-2017”). We attribute this phenomenon to the reason that GPT-4V may overfit its training data. How to prevent such hallucinations and alleviate the over-claim of GPT-4V is an important and valuable research direction.

Finally, we explore the ability of GPT-4V to understand and explain the framework or flowchart step by step in Figure 31. GPT-4V could accurately describe each part of the whole framework in detail and summarize the relationship between each part. Also, it demonstrates a satisfactory performance to understand the overall intention of the whole framework.

2.6.5 Aesthetic evaluation

We have also assessed the ability of GPT-4V to do the aesthetic evaluation. We manually constructed 50 marine images with high diversity then we uploaded the visual images to GPT-4V to generate the aesthetic score (scale of 10) based on the visual contents. To quantitatively evaluate the ability of GPT-4V for aesthetic assessment, we ask expert-level human labelers (3 annotators) to give the subjective scores towards the given marine images and we compute the mean value and the standard deviation. Then we first evaluate the alignment between the scores from GPT-4V and human labelers in terms of aesthetic measuring. We provide some qualitative results of GPT-4V in Figure 32. We observe that the scores generated by GPT-4V are highly correlated with human rating. GPT-4V successfully identifies the aesthetic quality of visual elements within the images and provides a comprehensive explanation for its scores. Our results reveal that GPT-4V achieves a promising agreement with humans on aesthetic quality assessment.

2.6.6 Temporal Sequence Understanding

We finally explore the potential ability of GPT-4V for temporal sequence understanding. Given the consecutive image frames sampled from the video sequence (e.g., uniformly sampling 8 frames), we concatenate the sampled frames to one image and then ask GPT-4V to summarize the event that happened in the given video sequence. The temporal sequence understanding requires the MLLMs to fully comprehend the information within the visual sequence. Understanding the event of a marine clip could be very valuable for detecting the abnormal behavior of marine creatures and then preventing the potential disaster. The results are illustrated in Figure 33. As illustrated in Figure, GPT-4V demonstrates the capability to recognize the action in the images and provide a detailed description. It has shown a promising potential to understand scenes from video and visual story generation.

2.7 Prompt Engineering

In this section, we aim to explore the effectiveness of introducing the current prompt engineering techniques designed for general-purpose MLLMs for marine research. We mainly focus on three settings: 1) few-shot prompts; 2) self-consistency and 3) chain-of-thoughts.

Under the first setting, we feed the GPT-4V with few-shot samples with corresponding annotations to guide GPT-4V as a domain expert and help it better understand our questions. Then we ask the GPT-4V for a similar question as shown in Figure 34. We observe that GPT-4V will still make mistakes and generate wrong responses even the few-shot prompts provided. We attribute this failure to the limited visual perception ability of GPT-4V. GPT-4V cannot effectively perform fine-grained object recognition.

To explore the self-consistency of the GPT-4V, we ask the GPT-4V to do the object counting task based on various prompts and we then perform voting to get the final object count result. Through this, we aim to measure the self-consistency of GPT-4V for the same visual input and the robustness of its generated responses. Through voting or feeding GPT-4V with clearer prompts, GPT-4V could generate more reliable and accurate object counting results as demonstrated in Figure 35.

Finally, we refer to the design of the chain-of-thoughts Yang et al. (2023) and add some simple explanations in our input prompts. The GPT-4V is asked to follow our explanation procedure and understand the reasoning inside the recognition. In this way, GPT-4V could describe more about its judgment and illustrate more supporting evidence. The results are reported in Figure 36. We observe that GPT-4V sucks the ability to accurately recognize marine objects even GPT-4V could generate plausible and detailed descriptions about the wrongly recognized object.

To sum up, the current prompt engineering techniques cannot heavily promote the visual recognition ability of GPT-4V on marine images. GPT-4V will still make mistakes for fine-grained marine object recognition and prompt engineering cannot alleviate the hallucination issue, effectively. To address these issues, more training data from the marine field should be included for further promoting the recognition ability of GPT-4V.

3 Discussions and Future Directions

3.1 Discussions

Possible for educational tool? While the performance of the GPT-4V is promising, we ask whether GPT-4V could be viewed as a potential educational tool that may in the future augment, but not replace, the nuanced analysis provided by trained marine professionals. GPT-4V could also play as a pivotal role in fostering a deeper understanding and appreciation for marine life among users of all ages and backgrounds. Through our findings in this study, we conclude that GPT-4V is far from generating valuable insights for domain experts.

Possible for labeling tool? With easy access to GPT-4V, it could actively encourage citizen science participation as a labeling tool, transforming ordinary individuals into valuable contributors to marine research. From our findings, we observe that GPT-4V cannot serve as a labeling tool for a wide spectrum of marine images since GPT-4V still makes many mistakes for challenging images. Moreover, such labeling is also only limited to image-level scene understanding. GPT-4V cannot generate accurate descriptions for the very fine-grained details.

Sample Bias. In our study, the testing samples are manually constructed, inevitably incorporating individual preferences and subjectivity. More importantly, our involved testing samples may not comprehensively represent real-world cases, and potentially over-estimate or down-estimate the challenges of utilizing GPT-4V for marine analysis.

3.2 Future Works

Our findings emphasize the need for continued research to enhance the accuracy and expertise of responses generated by GPT-4V. We hope that this study can inspire more comprehensive and targeted research into utilizing multimodal systems such as GPT-4V for domain-specific research and analysis. By harnessing the capabilities of these models, we can better meet the professional demands of experts, ultimately including the domain experts in the major users of GPT-4V. Furthermore, based on the feedback and further prompts from the domain experts, a fundamental question arises, could GPT-4V revise its responses over time? Such feedback-driven MLLM would further promote the user experience for obtaining more precise responses.

Through our experimental results, we have observed that GPT-4V cannot achieve fine-grained and accurate marine object recognition to satisfy the requirements of the domain experts. More training data from the marine field should be included to promote the visual recognition ability of GPT-4V. Furthermore, we also demonstrate that GPT-4V has shown a very limited ability to handle advanced marine analysis (e.g., counting, coverage estimation, composition statistic, etc) without utilizing an external professional tool. More domain-specific instruction-following data should be constructed to help GPT-4V yield explicit intermediate analysis results.

4 Conclusion

In this paper, our investigation of GPT-4V on marine analysis demonstrates some valuable findings and insights of MLLMs concerning visual understanding, logical reasoning, and expert capacity, indicating that there remains a considerable distance toward strong artificial intelligence as a domain expert.

References

Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
Beijbom et al. (2015) Oscar Beijbom, Peter J Edmunds, Chris Roelfsema, Jennifer Smith, David I Kline, Benjamin P Neal, Matthew J Dunlap, Vincent Moriarty, Tung-Yung Fan, Chih-Jui Tan, et al. Towards automated annotation of benthic survey images: Variability of human experts and operational modes of automation. PloS one, 10(7):e0130312, 2015.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
Busch et al. (2023) Felix Busch, Tianyu Han, Marcus Makowski, Daniel Truhn, Keno Bressem, and Lisa Adams. From text to image: Exploring gpt-4vision’s potential in advanced radiological analysis across subspecialties. arXiv preprint arXiv:2311.14777, 2023.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
Fu et al. (2023a) Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a.
Fu et al. (2023b) Chaoyou Fu, Renrui Zhang, Haojia Lin, Zihan Wang, Timin Gao, Yongdong Luo, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, et al. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023b.
Gao et al. (2023) Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023.
Ge et al. (2023) Wentao Ge, Shunian Chen, Guiming Chen, Junying Chen, Zhihong Chen, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xidong Wang, et al. Mllm-bench, evaluating multi-modal llms using gpt-4v. arXiv preprint arXiv:2311.13951, 2023.
Haixin et al. (2023) Liang Haixin, Zheng Ziqiang, Ma Zeyu, and Sai-Kit Yeung. Marinedet: Towards open-marine object detection. arXiv preprint arXiv:2310.01931, 2023.
Huang et al. (2023) Huajian Huang, Yinzhe Xu, Yingshu Chen, and Sai-Kit Yeung. 360vot: A new benchmark dataset for omnidirectional visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20566–20576, 2023.
Li et al. (2023a) Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
Li et al. (2023b) Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023b.
Li et al. (2023c) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
Lu et al. (2022) Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Palnitkar et al. (2023) Aadi Palnitkar, Rashmi Kapu, Xiaomin Lin, Cheng Liu, Nare Karapetyan, and Yiannis Aloimonos. Chatsim: Underwater simulation with natural language prompting. arXiv preprint arXiv:2308.04029, 2023.
Peng et al. (2023a) Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023a.
Peng et al. (2023b) Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023b.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Scao et al. (2022) Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, and et al. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022. doi: 10.48550/arXiv.2211.05100. URL https://doi.org/10.48550/arXiv.2211.05100.
Singh et al. (2023) Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, and Gust Verbruggen. Assessing gpt4-v on structured reasoning tasks. arXiv preprint arXiv:2312.11524, 2023.
Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Truong et al. (2023) Quang-Trung Truong, Tuan-Anh Vu, Tan-Sang Ha, Jakub Lokoč, Yue-Him Wong, Ajay Joneja, and Sai-Kit Yeung. Marine video kit: a new marine video dataset for content-based analysis and retrieval. In International Conference on Multimedia Modeling, pp. 539–550. Springer, 2023.
Wang et al. (2022) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
Xie et al. (2022) Kaibing Xie, Jian Yang, and Kang Qiu. A dataset with multibeam forward-looking sonar for underwater object detection. Scientific Data, 9(1):739, 2022.
Yang et al. (2023) Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1), 2023.
Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona T. Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022. doi: 10.48550/arXiv.2205.01068. URL https://doi.org/10.48550/arXiv.2205.01068.
Zhang et al. (2023) Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, and Linda Ruth Petzold. Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361, 2023.
Zheng et al. (2023a) Ziqiang Zheng, Tan-Sang Ha, Yingshu Chen, Haixin Liang, Apple Pui-Yi Chui, Yue-Him Wong, and Sai-Kit Yeung. Marine video cloud: A cloud-based video analytics platform for collaborative marine research. In OCEANS 2023-Limerick, pp. 1–6. IEEE, 2023a.
Zheng et al. (2023b) Ziqiang Zheng, Zhichao Xin, Zhibin Yu, and Sai-Kit Yeung. Real-time gan-based image enhancement for robust underwater monocular slam. Frontiers in Marine Science, 2023b.
Zheng et al. (2023c) Ziqiang Zheng, Jipeng Zhang, Tuan-Anh Vu, Shizhe Diao, Yue Him Wong Tim, and Sai-Kit Yeung. Marinegpt: Unlocking secrets of ocean to the public. arXiv preprint arXiv:2310.13596, 2023c.
Zhou et al. (2023) Peilin Zhou, Meng Cao, You-Liang Huang, Qichen Ye, Peiyan Zhang, Junling Liu, Yueqi Xie, Yining Hua, and Jaeboum Kim. Exploring recommendation capabilities of gpt-4v (ision): A preliminary case study. arXiv preprint arXiv:2311.04199, 2023.
Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Ziqiang et al. (2023) Zheng Ziqiang, Xie Yaofeng, Liang Haixin, Yu Zhibin, and Sai-Kit Yeung. Coralvos: Dataset and benchmark for coral video segmentation. arXiv preprint arXiv:2310.01946, 2023.