Image and Video Processing

New submissions
Cross-lists
Replacements

See recent articles

Showing new listings for Thursday, 10 October 2024

Total of 47 entries

Showing up to 2000 entries per page: fewer | more | all

[1] arXiv:2410.05272 [pdf, other]: Title: DVS: Blood cancer detection using novel CNN-based ensemble approach

Md Taimur Ahad, Israt Jahan Payel, Bo Song, Yan Li

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Blood cancer can only be diagnosed properly if it is detected early. Each year, more than 1.24 million new cases of blood cancer are reported worldwide. There are about 6,000 cancers worldwide due to this disease. The importance of cancer detection and classification has prompted researchers to evaluate Deep Convolutional Neural Networks for the purpose of classifying blood cancers. The objective of this research is to conduct an in-depth investigation of the efficacy and suitability of modern Convolutional Neural Network (CNN) architectures for the detection and classification of blood malignancies. The study focuses on investigating the potential of Deep Convolutional Neural Networks (D-CNNs), comprising not only the foundational CNN models but also those improved through transfer learning methods and incorporated into ensemble strategies, to detect diverse forms of blood cancer with a high degree of accuracy. This paper provides a comprehensive investigation into five deep learning architectures derived from CNNs. These models, namely VGG19, ResNet152v2, SEresNet152, ResNet101, and DenseNet201, integrate ensemble learning techniques with transfer learning strategies. A comparison of DenseNet201 (98.08%), VGG19 (96.94%), and SEresNet152 (90.93%) shows that DVS outperforms CNN. With transfer learning, DenseNet201 had 95.00% accuracy, VGG19 had 72.29%, and SEresNet152 had 94.16%. In the study, the ensemble DVS model achieved 98.76% accuracy. Based on our study, the ensemble DVS model is the best for detecting and classifying blood cancers.
[2] arXiv:2410.05341 [pdf, html, other]: Title: NeuroBOLT: Resting-state EEG-to-fMRI Synthesis with Multi-dimensional Feature Mapping

Yamin Li, Ange Lou, Ziyuan Xu, Shengchao Zhang, Shiyu Wang, Dario J. Englot, Soheil Kolouri, Daniel Moyer, Roza G. Bayrak, Catie Chang

Comments: This preprint has been accepted to NeurIPS 2024

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Functional magnetic resonance imaging (fMRI) is an indispensable tool in modern neuroscience, providing a non-invasive window into whole-brain dynamics at millimeter-scale spatial resolution. However, fMRI is constrained by issues such as high operation costs and immobility. With the rapid advancements in cross-modality synthesis and brain decoding, the use of deep neural networks has emerged as a promising solution for inferring whole-brain, high-resolution fMRI features directly from electroencephalography (EEG), a more widely accessible and portable neuroimaging modality. Nonetheless, the complex projection from neural activity to fMRI hemodynamic responses and the spatial ambiguity of EEG pose substantial challenges both in modeling and interpretability. Relatively few studies to date have developed approaches for EEG-fMRI translation, and although they have made significant strides, the inference of fMRI signals in a given study has been limited to a small set of brain areas and to a single condition (i.e., either resting-state or a specific task). The capability to predict fMRI signals in other brain areas, as well as to generalize across conditions, remain critical gaps in the field. To tackle these challenges, we introduce a novel and generalizable framework: NeuroBOLT, i.e., Neuro-to-BOLD Transformer, which leverages multi-dimensional representation learning from temporal, spatial, and spectral domains to translate raw EEG data to the corresponding fMRI activity signals across the brain. Our experiments demonstrate that NeuroBOLT effectively reconstructs resting-state fMRI signals from primary sensory, high-level cognitive areas, and deep subcortical brain regions, achieving state-of-the-art accuracy and significantly advancing the integration of these two modalities.
[3] arXiv:2410.05621 [pdf, html, other]: Title: RNR-Nav: A Real-World Visual Navigation System Using Renderable Neural Radiance Maps

Minsoo Kim, Obin Kwon, Howoong Jun, Songhwai Oh

Subjects: Image and Video Processing (eess.IV)

We propose a novel visual localization and navigation framework for real-world environments directly integrating observed visual information into the bird-eye-view map. While the renderable neural radiance map (RNR-Map) shows considerable promise in simulated settings, its deployment in real-world scenarios poses undiscovered challenges. RNR-Map utilizes projections of multiple vectors into a single latent code, resulting in information loss under suboptimal conditions. To address such issues, our enhanced RNR-Map for real-world robots, RNR-Map++, incorporates strategies to mitigate information loss, such as a weighted map and positional encoding. For robust real-time localization, we integrate a particle filter into the correlation-based localization framework using RNRMap++ without a rendering procedure. Consequently, we establish a real-world robot system for visual navigation utilizing RNR-Map++, which we call "RNR-Nav." Experimental results demonstrate that the proposed methods significantly enhance rendering quality and localization robustness compared to previous approaches. In real-world navigation tasks, RNR-Nav achieves a success rate of 84.4%, marking a 68.8% enhancement over the methods of the original RNR-Map paper.
[4] arXiv:2410.05882 [pdf, html, other]: Title: Future frame prediction in chest cine MR imaging using the PCA respiratory motion model and dynamically trained recurrent neural networks

Michel Pohl, Mitsuru Uesaka, Hiroyuki Takahashi, Kazuyuki Demachi, Ritu Bhusal Chhatkuli

Comments: 28 pages, 16 figures

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

Lung radiotherapy treatment systems are subject to a latency that leads to uncertainty in the estimated tumor location and high irradiation of healthy tissue. This work addresses future frame prediction in chest dynamic MRI sequences to compensate for that delay using RNNs trained with online learning algorithms. The latter enable networks to mitigate irregular movements, as they update synaptic weights with each new training example. Experiments were conducted using four publicly available 2D thoracic cine-MRI sequences. PCA decomposes the time-varying deformation vector field (DVF), computed with the Lucas-Kanade optical flow algorithm, into static deformation fields and low-dimensional time-dependent weights. We compare various algorithms to forecast the latter: linear regression, least mean squares (LMS), and RNNs trained with real-time recurrent learning (RTRL), unbiased online recurrent optimization, decoupled neural interfaces and sparse 1-step approximation (SnAp-1). That enables estimating the future DVFs and, in turn, the next frames by warping the initial image. Linear regression led to the lowest mean DVF error at a horizon h = 0.32s (the time interval in advance for which the prediction is made), equal to 1.30mm, followed by SnAp-1 and RTRL, whose error increased from 1.37mm to 1.44mm as h increased from 0.62s to 2.20s. Similarly, the structural similarity index measure (SSIM) of LMS decreased from 0.904 to 0.898 as h increased from 0.31s to 1.57s and was the highest among the algorithms compared for the latter horizons. SnAp-1 attained the highest SSIM for h $\geq$ 1.88s, with values of less than 0.898. The predicted images look similar to the original ones, and the highest errors occurred at challenging areas such as the diaphragm boundary at the end-of-inhale phase, where motion variability is more prominent, and regions where out-of-plane motion was more prevalent.
[5] arXiv:2410.06161 [pdf, html, other]: Title: Automated quality assessment using appearance-based simulations and hippocampus segmentation on low-field paediatric brain MR images

Vaanathi Sundaresan, Nicola K Dinsdale

Comments: MICCAI 2024 Low field pediatric brain magnetic resonance Image Segmentation and quality Assurance (LISA) Challenge

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Understanding the structural growth of paediatric brains is a key step in the identification of various neuro-developmental disorders. However, our knowledge is limited by many factors, including the lack of automated image analysis tools, especially in Low and Middle Income Countries from the lack of high field MR images available. Low-field systems are being increasingly explored in these countries, and, therefore, there is a need to develop automated image analysis tools for these images. In this work, as a preliminary step, we consider two tasks: 1) automated quality assurance and 2) hippocampal segmentation, where we compare multiple approaches. For the automated quality assurance task a DenseNet combined with appearance-based transformations for synthesising artefacts produced the best performance, with a weighted accuracy of 82.3%. For the segmentation task, registration of an average atlas performed the best, with a final Dice score of 0.61. Our results show that although the images can provide understanding of large scale pathologies and gross scale anatomical development, there still remain barriers for their use for more granular analyses.
[6] arXiv:2410.06385 [pdf, html, other]: Title: Skin Cancer Machine Learning Model Tone Bias

James Pope, Md Hassanuzzaman, Mingmar Sherpa, Omar Emara, Ayush Joshi, Nirmala Adhikari

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Background: Many open-source skin cancer image datasets are the result of clinical trials conducted in countries with lighter skin tones. Due to this tone imbalance, machine learning models derived from these datasets can perform well at detecting skin cancer for lighter skin tones. Any tone bias in these models could introduce fairness concerns and reduce public trust in the artificial intelligence health field.
Methods: We examine a subset of images from the International Skin Imaging Collaboration (ISIC) archive that provide tone information. The subset has a significant tone imbalance. These imbalances could explain a model's tone bias. To address this, we train models using the imbalanced dataset and a balanced dataset to compare against. The datasets are used to train a deep convolutional neural network model to classify the images as malignant or benign. We then evaluate the models' disparate impact, based on selection rate, relative to dark or light skin tone.
Results: Using the imbalanced dataset, we found that the model is significantly better at detecting malignant images in lighter tone resulting in a disparate impact of 0.577. Using the balanced dataset, we found that the model is also significantly better at detecting malignant images in lighter versus darker tones with a disparate impact of 0.684. Using the imbalanced or balanced dataset to train the model still results in a disparate impact well below the standard threshold of 0.80 which suggests the model is biased with respect to skin tone.
Conclusion: The results show that typical skin cancer machine learning models can be tone biased. These results provide evidence that diagnosis or tone imbalance is not the cause of the bias. Other techniques will be necessary to identify and address the bias in these models, an area of future investigation.
[7] arXiv:2410.06465 [pdf, other]: Title: On the Solution of Linearized Inverse Scattering Problems in Near-Field Microwave Imaging by Operator Inversion and Matched Filtering

Matthias M. Saurer, Han Na, Marius Brinkmann, Thomas F. Eibert

Comments: Submitted to IEEE Transactions on Microwave Theory and Techniques on 1st October 2024, and currently under revision

Subjects: Image and Video Processing (eess.IV)

Microwave imaging is commonly based on the solution of linearized inverse scattering problems by matched-filtering algorithms, i.e., by applying the adjoint of the forward scattering operator to the observation data. A more rigorous approach is the explicit inversion of the forward scattering operator, which is performed in this work for quasi-monostatic imaging scenarios based on a planar plane-wave representation according to the Weyl-identity and hierarchical acceleration algorithms. The inversion is achieved by a regularized iterative linear system of equations solver, where irregular observations as well as full probe correction are supported. In the spatial image generation low-pass filtering can be considered in order to reduce imaging artifacts. A corresponding spectral back-projection algorithm and a spatial back-projection algorithm together with improved focusing operators are also introduced and the resulting image generation algorithms are analyzed and compared for a variety of examples, comprising both simulated and measured observation data.
[8] arXiv:2410.06478 [pdf, html, other]: Title: MaskBlur: Spatial and Angular Data Augmentation for Light Field Image Super-Resolution

Wentao Chao, Fuqing Duan, Yulan Guo, Guanghui Wang

Comments: accepted by IEEE Transactions on Multimedia

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Data augmentation (DA) is an effective approach for enhancing model performance with limited data, such as light field (LF) image super-resolution (SR). LF images inherently possess rich spatial and angular information. Nonetheless, there is a scarcity of DA methodologies explicitly tailored for LF images, and existing works tend to concentrate solely on either the spatial or angular domain. This paper proposes a novel spatial and angular DA strategy named MaskBlur for LF image SR by concurrently addressing spatial and angular aspects. MaskBlur consists of spatial blur and angular dropout two components. Spatial blur is governed by a spatial mask, which controls where pixels are blurred, i.e., pasting pixels between the low-resolution and high-resolution domains. The angular mask is responsible for angular dropout, i.e., selecting which views to perform the spatial blur operation. By doing so, MaskBlur enables the model to treat pixels differently in the spatial and angular domains when super-resolving LF images rather than blindly treating all pixels equally. Extensive experiments demonstrate the efficacy of MaskBlur in significantly enhancing the performance of existing SR methods. We further extend MaskBlur to other LF image tasks such as denoising, deblurring, low-light enhancement, and real-world SR. Code is publicly available at \url{this https URL}.
[9] arXiv:2410.06483 [pdf, html, other]: Title: Deep Learning Ensemble for Predicting Diabetic Macular Edema Onset Using Ultra-Wide Field Color Fundus Image

Pengyao Qin, Arun J. Thirunavukarasu, Le Zhang

Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Diabetic macular edema (DME) is a severe complication of diabetes, characterized by thickening of the central portion of the retina due to accumulation of fluid. DME is a significant and common cause of visual impairment in diabetic patients. Center-involved DME (ci-DME) is the highest risk form of disease as fluid extends close to the fovea which is responsible for sharp central vision. Earlier diagnosis or prediction of ci-DME may improve treatment outcomes. Here, we propose an ensemble method to predict ci-DME onset within a year using ultra-wide-field color fundus photography (UWF-CFP) images provided by the DIAMOND Challenge. We adopted a variety of baseline state-of-the-art classification networks including ResNet, DenseNet, EfficientNet, and VGG with the aim of enhancing model robustness. The best performing models were Densenet 121, Resnet 152 and EfficientNet b7, and these were assembled into a definitive predictive model. The final ensemble model demonstrates a strong performance with an Area Under Curve (AUC) of 0.7017, an F1 score of 0.6512, and an Expected Calibration Error (ECE) of 0.2057 when deployed on a synthetic dataset. The performance of this ensemble model is comparable to previous studies despite training and testing in a more realistic setting, indicating the potential of UWF-CFP combined with a deep learning classification system to facilitate earlier diagnosis, better treatment decisions, and improved prognostication in ci-DME.
[10] arXiv:2410.06542 [pdf, html, other]: Title: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaria-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew Lungren, Mu Wei

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

In this work, we present MedImageInsight, an open-source medical imaging embedding model. MedImageInsight is trained on medical images with associated text and labels across a diverse collection of domains, including X-Ray, CT, MRI, dermoscopy, OCT, fundus photography, ultrasound, histopathology, and mammography. Rigorous evaluations demonstrate MedImageInsight's ability to achieve state-of-the-art (SOTA) or human expert level performance across classification, image-image search, and fine-tuning tasks. Specifically, on public datasets, MedImageInsight achieves SOTA in CT 3D medical image retrieval, as well as SOTA in disease classification and search for chest X-ray, dermatology, and OCT imaging. Furthermore, MedImageInsight achieves human expert performance in bone age estimation (on both public and partner data), as well as AUC above 0.9 in most other domains. When paired with a text decoder, MedImageInsight achieves near SOTA level single image report findings generation with less than 10\% the parameters of other models. Compared to fine-tuning GPT-4o with only MIMIC-CXR data for the same task, MedImageInsight outperforms in clinical metrics, but underperforms on lexical metrics where GPT-4o sets a new SOTA. Importantly for regulatory purposes, MedImageInsight can generate ROC curves, adjust sensitivity and specificity based on clinical need, and provide evidence-based decision support through image-image search (which can also enable retrieval augmented generation). In an independent clinical evaluation of image-image search in chest X-ray, MedImageInsight outperformed every other publicly available foundation model evaluated by large margins (over 6 points AUC), and significantly outperformed other models in terms of AI fairness (across age and gender). We hope releasing MedImageInsight will help enhance collective progress in medical imaging AI research and development.
[11] arXiv:2410.06624 [pdf, html, other]: Title: Optimized Magnetic Resonance Fingerprinting Using Ziv-Zakai Bound

Chaoguang Gong, Yue Hu, Peng Li, Lixian Zou, Congcong Liu, Yihang Zhou, Yanjie Zhu, Dong Liang, Haifeng Wang

Subjects: Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM); Applications (stat.AP)

Magnetic Resonance Fingerprinting (MRF) has emerged as a promising quantitative imaging technique within the field of Magnetic Resonance Imaging (MRI), offers comprehensive insights into tissue properties by simultaneously acquiring multiple tissue parameter maps in a single acquisition. Sequence optimization is crucial for improving the accuracy and efficiency of MRF. In this work, a novel framework for MRF sequence optimization is proposed based on the Ziv-Zakai bound (ZZB). Unlike the Cramér-Rao bound (CRB), which aims to enhance the quality of a single fingerprint signal with deterministic parameters, ZZB provides insights into evaluating the minimum mismatch probability for pairs of fingerprint signals within the specified parameter range in MRF. Specifically, the explicit ZZB is derived to establish a lower bound for the discrimination error in the fingerprint signal matching process within MRF. This bound illuminates the intrinsic limitations of MRF sequences, thereby fostering a deeper understanding of existing sequence performance. Subsequently, an optimal experiment design problem based on ZZB was formulated to ascertain the optimal scheme of acquisition parameters, maximizing discrimination power of MRF between different tissue types. Preliminary numerical experiments show that the optimized ZZB scheme outperforms both the conventional and CRB schemes in terms of the reconstruction accuracy of multiple parameter maps.
[12] arXiv:2410.06723 [pdf, html, other]: Title: Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts

Fredrik K. Gustafsson, Mattias Rantalainen

Comments: Preprint, work in progress

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Foundation models have recently become a popular research direction within computational pathology. They are intended to be general-purpose feature extractors, promising to achieve good performance on a range of downstream tasks. Real-world pathology image data does however exhibit considerable variability. Foundation models should be robust to these variations and other distribution shifts which might be encountered in practice. We evaluate two computational pathology foundation models: UNI (trained on more than 100,000 whole-slide images) and CONCH (trained on more than 1.1 million image-caption pairs), by utilizing them as feature extractors within prostate cancer grading models. We find that while UNI and CONCH perform well relative to baselines, the absolute performance can still be far from satisfactory in certain settings. The fact that foundation models have been trained on large and varied datasets does not guarantee that downstream models always will be robust to common distribution shifts.
[13] arXiv:2410.06757 [pdf, other]: Title: Diff-FMT: Diffusion Models for Fluorescence Molecular Tomography

Qianqian Xue, Peng Zhang, Xingyu Liu, Wenjian Wang, Guanglei Zhang

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Fluorescence molecular tomography (FMT) is a real-time, noninvasive optical imaging technology that plays a significant role in biomedical research. Nevertheless, the ill-posedness of the inverse problem poses huge challenges in FMT reconstructions. Previous various deep learning algorithms have been extensively explored to address the critical issues, but they remain faces the challenge of high data dependency with poor image quality. In this paper, we, for the first time, propose a FMT reconstruction method based on a denoising diffusion probabilistic model (DDPM), termed Diff-FMT, which is capable of obtaining high-quality reconstructed images from noisy images. Specifically, we utilize the noise addition mechanism of DDPM to generate diverse training samples. Through the step-by-step probability sampling mechanism in the inverse process, we achieve fine-grained reconstruction of the image, avoiding issues such as loss of image detail that can occur with end-to-end deep-learning methods. Additionally, we introduce the fluorescence signals as conditional information in the model training to sample a reconstructed image that is highly consistent with the input fluorescence signals from the noisy images. Numerous experimental results show that Diff-FMT can achieve high-resolution reconstruction images without relying on large-scale datasets compared with other cutting-edge algorithms.
[14] arXiv:2410.06781 [pdf, html, other]: Title: Transesophageal Echocardiography Generation using Anatomical Models

Emmanuel Oladokun, Musa Abdulkareem, Jurica Šprem, Vicente Grau

Comments: MICCAI2023; DALI Workshop

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Through automation, deep learning (DL) can enhance the analysis of transesophageal echocardiography (TEE) images. However, DL methods require large amounts of high-quality data to produce accurate results, which is difficult to satisfy. Data augmentation is commonly used to tackle this issue. In this work, we develop a pipeline to generate synthetic TEE images and corresponding semantic labels. The proposed data generation pipeline expands on an existing pipeline that generates synthetic transthoracic echocardiography images by transforming slices from anatomical models into synthetic images. We also demonstrate that such images can improve DL network performance through a left-ventricle semantic segmentation task. For the pipeline's unpaired image-to-image (I2I) translation section, we explore two generative methods: CycleGAN and contrastive unpaired translation. Next, we evaluate the synthetic images quantitatively using the Fréchet Inception Distance (FID) Score and qualitatively through a human perception quiz involving expert cardiologists and the average researcher.
In this study, we achieve a dice score improvement of up to 10% when we augment datasets with our synthetic images. Furthermore, we compare established methods of assessing unpaired I2I translation and observe a disagreement when evaluating the synthetic images. Finally, we see which metric better predicts the generated data's efficacy when used for data augmentation.
[15] arXiv:2410.06825 [pdf, other]: Title: K-SAM: A Prompting Method Using Pretrained U-Net to Improve Zero Shot Performance of SAM on Lung Segmentation in CXR Images

Mohamed Deriche, Mohammad Marufur

Subjects: Image and Video Processing (eess.IV); Machine Learning (cs.LG)

In clinical procedures, precise localization of the target area is an essential step for clinical diagnosis and screening. For many diagnostic applications, lung segmentation of chest X-ray images is an essential first step that significantly reduces the image size to speed up the subsequent analysis. One of the primary difficulties with this task is segmenting the lung regions covered by dense abnormalities also known as opacities due to diseases like pneumonia and tuberculosis. SAM has astonishing generalization capabilities for category agnostic segmentation. In this study we propose an algorithm to improve zero shot performance of SAM on lung region segmentation task by automatic prompt selection. Two separate UNet models were trained, one for predicting lung segments and another for heart segment. Though these predictions lack fine details around the edges, they provide positive and negative points as prompt for SAM. Using proposed prompting method zero shot performance of SAM is evaluated on two benchmark datasets. ViT-l version of the model achieved slightly better performance compared to other two versions, ViTh and ViTb. It yields an average Dice score of 95.5 percent and 94.9 percent on hold out data for two datasets respectively. Though, for most of the images, SAM did outstanding segmentation, its prediction was way off for some of the images. After careful inspection it is found that all of these images either had extreme abnormality or distorted shape. Unlike most of the research performed so far on lung segmentation from CXR images using SAM, this study proposes a fully automated prompt selection process only from the input image. Our finding indicates that using pretrained models for prompt selection can utilize SAM impressive generalization capability to its full extent.
[16] arXiv:2410.06892 [pdf, html, other]: Title: Selecting the Best Sequential Transfer Path for Medical Image Segmentation with Limited Labeled Data

Jingyun Yang, Jingge Wang, Guoqing Zhang, Yang Li

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

The medical image processing field often encounters the critical issue of scarce annotated data. Transfer learning has emerged as a solution, yet how to select an adequate source task and effectively transfer the knowledge to the target task remains challenging. To address this, we propose a novel sequential transfer scheme with a task affinity metric tailored for medical images. Considering the characteristics of medical image segmentation tasks, we analyze the image and label similarity between tasks and compute the task affinity scores, which assess the relatedness among tasks. Based on this, we select appropriate source tasks and develop an effective sequential transfer strategy by incorporating intermediate source tasks to gradually narrow the domain discrepancy and minimize the transfer cost. Thereby we identify the best sequential transfer path for the given target task. Extensive experiments on three MRI medical datasets, FeTS 2022, iSeg-2019, and WMH, demonstrate the efficacy of our method in finding the best source sequence. Compared with directly transferring from a single source task, the sequential transfer results underline a significant improvement in target task performance, achieving an average of 2.58% gain in terms of segmentation Dice score, notably, 6.00% for FeTS 2022. Code is available at the git repository.
[17] arXiv:2410.06974 [pdf, html, other]: Title: Diagnosis of Malignant Lymphoma Cancer Using Hybrid Optimized Techniques Based on Dense Neural Networks

Salah A. Aly, Ali Bakhiet, Mazen Balat

Comments: 6 pages, 5 figures, 4 tables, IEEE ICCA

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Lymphoma diagnosis, particularly distinguishing between subtypes, is critical for effective treatment but remains challenging due to the subtle morphological differences in histopathological images. This study presents a novel hybrid deep learning framework that combines DenseNet201 for feature extraction with a Dense Neural Network (DNN) for classification, optimized using the Harris Hawks Optimization (HHO) algorithm. The model was trained on a dataset of 15,000 biopsy images, spanning three lymphoma subtypes: Chronic Lymphocytic Leukemia (CLL), Follicular Lymphoma (FL), and Mantle Cell Lymphoma (MCL). Our approach achieved a testing accuracy of 99.33\%, demonstrating significant improvements in both accuracy and model interpretability. Comprehensive evaluation using precision, recall, F1-score, and ROC-AUC underscores the model's robustness and potential for clinical adoption. This framework offers a scalable solution for improving diagnostic accuracy and efficiency in oncology.
[18] arXiv:2410.06997 [pdf, html, other]: Title: A Diffusion-based Xray2MRI Model: Generating Pseudo-MRI Volumes From one Single X-ray

Zhe Wang, Rachid Jennane, Aladine Chetouani, Mohamed Jarraya

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Knee osteoarthritis (KOA) is a prevalent musculoskeletal disorder, and X-rays are commonly used for its diagnosis due to their cost-effectiveness. Magnetic Resonance Imaging (MRI), on the other hand, offers detailed soft tissue visualization and has become a valuable supplementary diagnostic tool for KOA. Unfortunately, the high cost and limited accessibility of MRI hinder its widespread use, leaving many patients with KOA reliant solely on X-ray imaging. In this study, we introduce a novel diffusion-based Xray2MRI model capable of generating pseudo-MRI volumes from one single X-ray image. In addition to using X-rays as conditional input, our model integrates target depth, KOA probability distribution, and image intensity distribution modules to guide the synthesis process, ensuring that the generated corresponding slices accurately correspond to the anatomical structures. Experimental results demonstrate that by integrating information from X-rays with additional input data, our proposed approach is capable of generating pseudo-MRI sequences that approximate real MRI scans. Moreover, by increasing the inference times, the model achieves effective interpolation, further improving the continuity and smoothness of the generated MRI sequences, representing one promising initial attempt for cost-effective medical imaging solutions.
[19] arXiv:2410.07043 [pdf, html, other]: Title: Z-upscaling: Optical Flow Guided Frame Interpolation for Isotropic Reconstruction of 3D EM Volumes

Fisseha A. Ferede, Ali Khalighifar, Jaison John, Krishnan Venkataraman, Khaled Khairy

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

We propose a novel optical flow based approach to enhance the axial resolution of anisotropic 3D EM volumes to achieve isotropic 3D reconstruction. Assuming spatial continuity of 3D biological structures in well aligned EM volumes, we reasoned that optical flow estimation techniques, often applied for temporal resolution enhancement in videos, can be utilized. Pixel level motion is estimated between neighboring 2D slices along z, using spatial gradient flow estimates to interpolate and generate new 2D slices resulting in isotropic voxels. We leverage recent state-of-the-art learning methods for video frame interpolation and transfer learning techniques, and demonstrate the success of our approach on publicly available ultrastructure EM volumes.
[20] arXiv:2410.07111 [pdf, other]: Title: Utility of Multimodal Large Language Models in Analyzing Chest X-ray with Incomplete Contextual Information

Choonghan Kim, Seonhee Cho, Joo Heung Yoon

Subjects: Image and Video Processing (eess.IV); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

Background: Large language models (LLMs) are gaining use in clinical settings, but their performance can suffer with incomplete radiology reports. We tested whether multimodal LLMs (using text and images) could improve accuracy and understanding in chest radiography reports, making them more effective for clinical decision support.
Purpose: To assess the robustness of LLMs in generating accurate impressions from chest radiography reports using both incomplete data and multimodal data. Material and Methods: We used 300 radiology image-report pairs from the MIMIC-CXR database. Three LLMs (OpenFlamingo, MedFlamingo, IDEFICS) were tested in both text-only and multimodal formats. Impressions were first generated from the full text, then tested by removing 20%, 50%, and 80% of the text. The impact of adding images was evaluated using chest x-rays, and model performance was compared using three metrics with statistical analysis.
Results: The text-only models (OpenFlamingo, MedFlamingo, IDEFICS) had similar performance (ROUGE-L: 0.39 vs. 0.21 vs. 0.21; F1RadGraph: 0.34 vs. 0.17 vs. 0.17; F1CheXbert: 0.53 vs. 0.40 vs. 0.40), with OpenFlamingo performing best on complete text (p<0.001). Performance declined with incomplete data across all models. However, adding images significantly boosted the performance of MedFlamingo and IDEFICS (p<0.001), equaling or surpassing OpenFlamingo, even with incomplete text. Conclusion: LLMs may produce low-quality outputs with incomplete radiology data, but multimodal LLMs can improve reliability and support clinical decision-making.
Keywords: Large language model; multimodal; semantic analysis; Chest Radiography; Clinical Decision Support;
[21] arXiv:2410.07148 [pdf, html, other]: Title: Lateral Ventricle Shape Modeling using Peripheral Area Projection for Longitudinal Analysis

Wonjung Park, Suhyun Ahn, Jinah Park

Comments: Annual Conference on Medical Image Understanding and Analysis (2024)

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

The deformation of the lateral ventricle (LV) shape is widely studied to identify specific morphometric changes associated with diseases. Since LV enlargement is considered a relative change due to brain atrophy, local longitudinal LV deformation can indicate deformation in adjacent brain areas. However, conventional methods for LV shape analysis focus on modeling the solely segmented LV mask. In this work, we propose a novel deep learning-based approach using peripheral area projection, which is the first attempt to analyze LV considering its surrounding areas. Our approach matches the baseline LV mesh by deforming the shape of follow-up LVs, while optimizing the corresponding points of the same adjacent brain area between the baseline and follow-up LVs. Furthermore, we quantitatively evaluated the deformation of the left LV in normal (n=10) and demented subjects (n=10), and we found that each surrounding area (thalamus, caudate, hippocampus, amygdala, and right LV) projected onto the surface of LV shows noticeable differences between normal and demented subjects.

[22] arXiv:2410.05342 (cross-list from q-bio.NC) [pdf, html, other]: Title: Multi-Stage Graph Learning for fMRI Analysis to Diagnose Neuro-Developmental Disorders

Wenjing Gao, Yuanyuan Yang, Jianrui Wei, Xuntao Yin, Xinhan Di

Comments: Accepted by CVPR 2024 CV4Science Workshop (8 pages, 4 figures, 2 tables)

Subjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

The insufficient supervision limit the performance of the deep supervised models for brain disease diagnosis. It is important to develop a learning framework that can capture more information in limited data and insufficient supervision. To address these issues at some extend, we propose a multi-stage graph learning framework which incorporates 1) pretrain stage : self-supervised graph learning on insufficient supervision of the fmri data 2) fine-tune stage : supervised graph learning for brain disorder diagnosis. Experiment results on three datasets, Autism Brain Imaging Data Exchange ABIDE I, ABIDE II and ADHD with AAL1,demonstrating the superiority and generalizability of the proposed framework compared to the state of art of models.(ranging from 0.7330 to 0.9321,0.7209 to 0.9021,0.6338 to 0.6699)
[23] arXiv:2410.05403 (cross-list from cs.CV) [pdf, other]: Title: Deep learning-based Visual Measurement Extraction within an Adaptive Digital Twin Framework from Limited Data Using Transfer Learning

Mehrdad Shafiei Dizaji

Comments: 37, 14

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Digital Twins technology is revolutionizing decision-making in scientific research by integrating models and simulations with real-time data. Unlike traditional Structural Health Monitoring methods, which rely on computationally intensive Digital Image Correlation and have limitations in real-time data integration, this research proposes a novel approach using Artificial Intelligence. Specifically, Convolutional Neural Networks are employed to analyze structural behaviors in real-time by correlating Digital Image Correlation speckle pattern images with deformation fields. Initially focusing on two-dimensional speckle patterns, the research extends to three-dimensional applications using stereo-paired images for comprehensive deformation analysis. This method overcomes computational challenges by utilizing a mix of synthetically generated and authentic speckle pattern images for training the Convolutional Neural Networks. The models are designed to be robust and versatile, offering a promising alternative to traditional measurement techniques and paving the way for advanced applications in three-dimensional modeling. This advancement signifies a shift towards more efficient and dynamic structural health monitoring by leveraging the power of Artificial Intelligence for real-time simulation and analysis.
[24] arXiv:2410.05410 (cross-list from cs.CV) [pdf, html, other]: Title: Enhanced Super-Resolution Training via Mimicked Alignment for Real-World Scenes

Omar Elezabi, Zongwei Wu, Radu Timofte

Comments: Accepted by ACCV 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Image super-resolution methods have made significant strides with deep learning techniques and ample training data. However, they face challenges due to inherent misalignment between low-resolution (LR) and high-resolution (HR) pairs in real-world datasets. In this study, we propose a novel plug-and-play module designed to mitigate these misalignment issues by aligning LR inputs with HR images during training. Specifically, our approach involves mimicking a novel LR sample that aligns with HR while preserving the degradation characteristics of the original LR samples. This module seamlessly integrates with any SR model, enhancing robustness against misalignment. Importantly, it can be easily removed during inference, therefore without introducing any parameters on the conventional SR models. We comprehensively evaluate our method on synthetic and real-world datasets, demonstrating its effectiveness across a spectrum of SR models, including traditional CNNs and state-of-the-art Transformers. The source codes will be publicly made available at this https URL .
[25] arXiv:2410.05443 (cross-list from cs.CV) [pdf, html, other]: Title: A Deep Learning-Based Approach for Mangrove Monitoring

Lucas José Velôso de Souza, Ingrid Valverde Reis Zreik, Adrien Salem-Sermanet, Nacéra Seghouani, Lionel Pourchier

Comments: 12 pages, accepted to the MACLEAN workshop of ECML/PKDD 2024

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Mangroves are dynamic coastal ecosystems that are crucial to environmental health, economic stability, and climate resilience. The monitoring and preservation of mangroves are of global importance, with remote sensing technologies playing a pivotal role in these efforts. The integration of cutting-edge artificial intelligence with satellite data opens new avenues for ecological monitoring, potentially revolutionizing conservation strategies at a time when the protection of natural resources is more crucial than ever. The objective of this work is to provide a comprehensive evaluation of recent deep-learning models on the task of mangrove segmentation. We first introduce and make available a novel open-source dataset, MagSet-2, incorporating mangrove annotations from the Global Mangrove Watch and satellite images from Sentinel-2, from mangrove positions all over the world. We then benchmark three architectural groups, namely convolutional, transformer, and mamba models, using the created dataset. The experimental outcomes further validate the deep learning community's interest in the Mamba model, which surpasses other architectures in all metrics.
[26] arXiv:2410.05474 (cross-list from cs.CV) [pdf, html, other]: Title: R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?

Chunyi Li, Jianbo Zhang, Zicheng Zhang, Haoning Wu, Yuan Tian, Wei Sun, Guo Lu, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)

The outstanding performance of Large Multimodal Models (LMMs) has made them widely applied in vision-related tasks. However, various corruptions in the real world mean that images will not be as ideal as in simulations, presenting significant challenges for the practical application of LMMs. To address this issue, we introduce R-Bench, a benchmark focused on the **Real-world Robustness of LMMs**. Specifically, we: (a) model the complete link from user capture to LMMs reception, comprising 33 corruption dimensions, including 7 steps according to the corruption sequence, and 7 groups based on low-level attributes; (b) collect reference/distorted image dataset before/after corruption, including 2,970 question-answer pairs with human labeling; (c) propose comprehensive evaluation for absolute/relative robustness and benchmark 20 mainstream LMMs. Results show that while LMMs can correctly handle the original reference images, their performance is not stable when faced with distorted images, and there is a significant gap in robustness compared to the human visual system. We hope that R-Bench will inspire improving the robustness of LMMs, **extending them from experimental simulations to the real-world application**. Check this https URL for details.
[27] arXiv:2410.05607 (cross-list from physics.optics) [pdf, html, other]: Title: Single picture single photon single pixel 3D imaging through unknown thick scattering medium

Long Pan, Yunan Wang, Yijie Lou, Xiaohua Feng

Subjects: Optics (physics.optics); Image and Video Processing (eess.IV)

Imaging through thick scattering media presents significant challenges, particularly for three-dimensional (3D) applications. This manuscript demonstrates a novel scheme for single-image-enabled 3D imaging through such media, treating the scattering medium as a lens. This approach captures a comprehensive image containing information about objects hidden at various depths. By leveraging depth from focus and the reduced thickness of the scattering medium for single-pixel imaging, the proposed method ensures robust 3D imaging capabilities. We develop both traditional metric-based and deep learning-based methods to extract depth information for each pixel, allowing us to explore the locations of both positive and negative objects, whether shallow or deep. Remarkably, this scheme enables the simultaneous 3D reconstruction of targets concealed within the scattering medium. Specifically, we successfully reconstructed targets buried at depths of 5 mm and 30 mm within a total medium thickness of 60 mm. Additionally, we can effectively distinguish targets at three different depths. Notably, this scheme requires no prior knowledge of the scattering medium, no invasive procedures, reference measurements, or calibration.
[28] arXiv:2410.06068 (cross-list from cs.HC) [pdf, html, other]: Title: Resolution limit of the eye: how many pixels can we see?

Maliha Ashraf, Alexandre Chapiro, Rafał K. Mantiuk

Comments: Main document: 12 pages, 4 figures, 1 table. Supplementary: 14 pages, 12 figures, 4 tables

Subjects: Human-Computer Interaction (cs.HC); Graphics (cs.GR); Multimedia (cs.MM); Image and Video Processing (eess.IV)

As large engineering efforts go towards improving the resolution of mobile, AR and VR displays, it is important to know the maximum resolution at which further improvements bring no noticeable benefit. This limit is often referred to as the "retinal resolution", although the limiting factor may not necessarily be attributed to the retina. To determine the ultimate resolution at which an image appears sharp to our eyes with no perceivable blur, we created an experimental setup with a sliding display, which allows for continuous control of the resolution. The lack of such control was the main limitation of the previous studies. We measure achromatic (black-white) and chromatic (red-green and yellow-violet) resolution limits for foveal vision, and at two eccentricities (10 and 20 deg). Our results demonstrate that the resolution limit is higher than what was previously believed, reaching 94 pixels-per-degree (ppd) for foveal achromatic vision, 89 ppd for red-green patterns, and 53 ppd for yellow-violet patterns. We also observe a much larger drop in the resolution limit for chromatic patterns (red-green and yellow-violet) than for achromatic. Our results set the north star for display development, with implications for future imaging, rendering and video coding technologies.
[29] arXiv:2410.06129 (cross-list from physics.med-ph) [pdf, html, other]: Title: Pinv-Recon: Generalized MR Image Reconstruction via Pseudoinversion of the Encoding Matrix

Kylie Yeung, Fergus V Gleeson, Rolf F Schulte, Anthony McIntyre, Sebastien Serres, Peter Morris, Dorothee Auer, Damian J Tyler, James T Grist, Florian Wiesinger

Comments: 26 pages, 8 figures, 2 tables (+ Supplementary Material). Submitted to Magnetic Resonance in Medicine

Subjects: Medical Physics (physics.med-ph); Image and Video Processing (eess.IV)

Purpose: To present a novel generalized MR image reconstruction based on pseudoinversion of the encoding matrix (Pinv-Recon) as a simple yet powerful method, and demonstrate its computational feasibility for diverse MR imaging applications. Methods: MR image encoding constitutes a linear mapping of the unknown image to the measured k-space data mediated via an encoding matrix ($ data = Encode \times image$). Pinv-Recon addresses MR image reconstruction as a linear inverse problem ($image = Encode^{-1} \times data$), explicitly calculating the Moore-Penrose pseudoinverse of the encoding matrix using truncated singular value decomposition (tSVD). Using a discretized, algebraic notation, we demonstrate constructing a generalized encoding matrix by stacking relevant encoding mechanisms (e.g., gradient encoding, coil sensitivity encoding, chemical shift inversion) and encoding distortions (e.g., off-center positioning, B$_0$ inhomogeneity, spatiotemporal gradient imperfections, transient relaxation effects). Iterative reconstructions using the explicit generalized encoding matrix, and the computation of the spatial-response-function (SRF) and noise amplification, were demonstrated. Results: We evaluated the computation times and memory requirements (time ~ (size of the encoding matrix)$^{1.4}$). Using the Shepp-Logan phantom, we demonstrated the versatility of the method for various intertwined MR image encoding and distortion mechanisms, achieving better MSE, PSNR and SSIM metrics than conventional methods. A diversity of datasets, including the ISMRM CG-SENSE challenge, were used to validate Pinv-Recon. Conclusion: Although pseudo-inversion of large encoding matrices was once deemed computationally intractable, recent advances make Pinv-Recon feasible. It has great promise for both research and clinical applications, and for educational use.
[30] arXiv:2410.06149 (cross-list from cs.CV) [pdf, other]: Title: Toward Scalable Image Feature Compression: A Content-Adaptive and Diffusion-Based Approach

Sha Guo, Zhuo Chen, Yang Zhao, Ning Zhang, Xiaotong Li, Lingyu Duan

Journal-ref: in Proceedings of the 31st ACM International Conference on Multimedia, pp. 1431-1442, 2023

Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)

Traditional image codecs emphasize signal fidelity and human perception, often at the expense of machine vision tasks. Deep learning methods have demonstrated promising coding performance by utilizing rich semantic embeddings optimized for both human and machine vision. However, these compact embeddings struggle to capture fine details such as contours and textures, resulting in imperfect reconstructions. Furthermore, existing learning-based codecs lack scalability. To address these limitations, this paper introduces a content-adaptive diffusion model for scalable image compression. The proposed method encodes fine textures through a diffusion process, enhancing perceptual quality while preserving essential features for machine vision tasks. The approach employs a Markov palette diffusion model combined with widely used feature extractors and image generators, enabling efficient data compression. By leveraging collaborative texture-semantic feature extraction and pseudo-label generation, the method accurately captures texture information. A content-adaptive Markov palette diffusion model is then applied to represent both low-level textures and high-level semantic content in a scalable manner. This framework offers flexible control over compression ratios by selecting intermediate diffusion states, eliminating the need for retraining deep learning models at different operating points. Extensive experiments demonstrate the effectiveness of the proposed framework in both image reconstruction and downstream machine vision tasks such as object detection, segmentation, and facial landmark detection, achieving superior perceptual quality compared to state-of-the-art methods.
[31] arXiv:2410.06180 (cross-list from cs.IR) [pdf, html, other]: Title: CBIDR: A novel method for information retrieval combining image and data by means of TOPSIS applied to medical diagnosis

Humberto Giuri, Renato A. Krohling

Comments: 28 pages

Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

Content-Based Image Retrieval (CBIR) have shown promising results in the field of medical diagnosis, which aims to provide support to medical professionals (doctor or pathologist). However, the ultimate decision regarding the diagnosis is made by the medical professional, drawing upon their accumulated experience. In this context, we believe that artificial intelligence can play a pivotal role in addressing the challenges in medical diagnosis not by making the final decision but by assisting in the diagnosis process with the most relevant information. The CBIR methods use similarity metrics to compare feature vectors generated from images using Convolutional Neural Networks (CNNs). In addition to the information contained in medical images, clinical data about the patient is often available and is also relevant in the final decision-making process by medical professionals. In this paper, we propose a novel method named CBIDR, which leverage both medical images and clinical data of patient, combining them through the ranking algorithm TOPSIS. The goal is to aid medical professionals in their final diagnosis by retrieving images and clinical data of patient that are most similar to query data from the database. As a case study, we illustrate our CBIDR for diagnostic of oral cancer including histopathological images and clinical data of patient. Experimental results in terms of accuracy achieved 97.44% in Top-1 and 100% in Top-5 showing the effectiveness of the proposed approach.
[32] arXiv:2410.06553 (cross-list from cs.LG) [pdf, html, other]: Title: DCP: Learning Accelerator Dataflow for Neural Network via Propagation

Peng Xu, Wenqi Shao, Mingyu Ding, Ping Luo

Subjects: Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Deep neural network (DNN) hardware (HW) accelerators have achieved great success in improving DNNs' performance and efficiency. One key reason is dataflow in executing a DNN layer, including on-chip data partitioning, computation parallelism, and scheduling policy, which have large impacts on latency and energy consumption. Unlike prior works that required considerable efforts from HW engineers to design suitable dataflows for different DNNs, this work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort. It has several attractive benefits that prior arts do not have. (i) We translate the HW dataflow configuration into a code representation in a unified dataflow coding space, which can be optimized by backpropagating gradients given a DNN layer or network. (ii) DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives e.g., latency and energy. (iii) It can be easily generalized to unseen HW configurations in a zero-shot or few-shot learning manner. For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples. Extensive experiments on several representative models such as MobileNet, ResNet, and ViT show that DCP outperforms its counterparts in various settings.
[33] arXiv:2410.06682 (cross-list from cs.CV) [pdf, html, other]: Title: Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zujun Ma, Chao Zhang

Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Image and Video Processing (eess.IV)

Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimization (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimized using DPO. To further improve training, we introduce a novel multi-round DPO (mrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initializing the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilize the process. To address potential catastrophic forgetting of non-captioning abilities due to mrDPO, we propose rebirth tuning, which finetunes the pre-DPO LLM by using the captions generated by the mrDPO-trained model as supervised labels. Experiments show that mrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing global and local error rates by 40\% and 20\%, respectively, while decreasing the repetition rate by 35\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining competitive performance to the state-of-the-art on widely used video question-answering benchmark among models of similar size. Upon acceptance, we will release the code, model checkpoints, and training and test data. Demos are available at \href{this https URL}{this https URL}.
[34] arXiv:2410.06689 (cross-list from cs.CV) [pdf, html, other]: Title: Perceptual Quality Assessment of Trisoup-Lifting Encoded 3D Point Clouds

Juncheng Long, Honglei Su, Qi Liu, Hui Yuan, Wei Gao, Jiarun Song, Zhou Wang

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

No-reference bitstream-layer point cloud quality assessment (PCQA) can be deployed without full decoding at any network node to achieve real-time quality monitoring. In this work, we develop the first PCQA model dedicated to Trisoup-Lifting encoded 3D point clouds by analyzing bitstreams without full decoding. Specifically, we investigate the relationship among texture bitrate per point (TBPP), texture complexity (TC) and texture quantization parameter (TQP) while geometry encoding is lossless. Subsequently, we estimate TC by utilizing TQP and TBPP. Then, we establish a texture distortion evaluation model based on TC, TBPP and TQP. Ultimately, by integrating this texture distortion model with a geometry attenuation factor, a function of trisoupNodeSizeLog2 (tNSL), we acquire a comprehensive NR bitstream-layer PCQA model named streamPCQ-TL. In addition, this work establishes a database named WPC6.0, the first and largest PCQA database dedicated to Trisoup-Lifting encoding mode, encompassing 400 distorted point clouds with both 4 geometric multiplied by 5 texture distortion levels. Experiment results on M-PCCD, ICIP2020 and the proposed WPC6.0 database suggest that the proposed streamPCQ-TL model exhibits robust and notable performance in contrast to existing advanced PCQA metrics, particularly in terms of computational cost. The dataset and source code will be publicly released at \href{this https URL}{\textit{this https URL}}
[35] arXiv:2410.06818 (cross-list from cs.CV) [pdf, other]: Title: An Improved Approach for Cardiac MRI Segmentation based on 3D UNet Combined with Papillary Muscle Exclusion

Narjes Benameur, Ramzi Mahmoudi, Mohamed Deriche, Amira fayouka, Imene Masmoudi, Nessrine Zoghlami

Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

Left ventricular ejection fraction (LVEF) is the most important clinical parameter of cardiovascular function. The accuracy in estimating this parameter is highly dependent upon the precise segmentation of the left ventricle (LV) structure at the end diastole and systole phases. Therefore, it is crucial to develop robust algorithms for the precise segmentation of the heart structure during different phases. Methodology: In this work, an improved 3D UNet model is introduced to segment the myocardium and LV, while excluding papillary muscles, as per the recommendation of the Society for Cardiovascular Magnetic Resonance. For the practical testing of the proposed framework, a total of 8,400 cardiac MRI images were collected and analysed from the military hospital in Tunis (HMPIT), as well as the popular ACDC public dataset. As performance metrics, we used the Dice coefficient and the F1 score for validation/testing of the LV and the myocardium segmentation. Results: The data was split into 70%, 10%, and 20% for training, validation, and testing, respectively. It is worth noting that the proposed segmentation model was tested across three axis views: basal, medio basal and apical at two different cardiac phases: end diastole and end systole instances. The experimental results showed a Dice index of 0.965 and 0.945, and an F1 score of 0.801 and 0.799, at the end diastolic and systolic phases, respectively. Additionally, clinical evaluation outcomes revealed a significant difference in the LVEF and other clinical parameters when the papillary muscles were included or excluded.
[36] arXiv:2410.06866 (cross-list from cs.CV) [pdf, html, other]: Title: Secure Video Quality Assessment Resisting Adversarial Attacks

Ao-Xiang Zhang, Yu Ran, Weixuan Tang, Yuan-Gen Wang, Qingxiao Guan, Chunsheng Yang

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

The exponential surge in video traffic has intensified the imperative for Video Quality Assessment (VQA). Leveraging cutting-edge architectures, current VQA models have achieved human-comparable accuracy. However, recent studies have revealed the vulnerability of existing VQA models against adversarial attacks. To establish a reliable and practical assessment system, a secure VQA model capable of resisting such malicious attacks is urgently demanded. Unfortunately, no attempt has been made to explore this issue. This paper first attempts to investigate general adversarial defense principles, aiming at endowing existing VQA models with security. Specifically, we first introduce random spatial grid sampling on the video frame for intra-frame defense. Then, we design pixel-wise randomization through a guardian map, globally neutralizing adversarial perturbations. Meanwhile, we extract temporal information from the video sequence as compensation for inter-frame defense. Building upon these principles, we present a novel VQA framework from the security-oriented perspective, termed SecureVQA. Extensive experiments indicate that SecureVQA sets a new benchmark in security while achieving competitive VQA performance compared with state-of-the-art models. Ablation studies delve deeper into analyzing the principles of SecureVQA, demonstrating their generalization and contributions to the security of leading VQA models.

[37] arXiv:2301.06081 (replaced) [pdf, html, other]: Title: Learning to adapt unknown noise for hyperspectral image denoising

Xiangyu Rui, Xiangyong Cao, Jun Shu, Qian Zhao, Deyu Meng

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

For hyperspectral image (HSI) denoising task, the causes of noise embeded in an HSI are typically complex and uncontrollable. Thus, it remains a challenge for model-based HSI denoising methods to handle complex noise. To enhance the noise-handling capabilities of existing model-based methods, we resort to design a general weighted data fidelity term. The weight in this term is used to assess the noise intensity and thus elementwisely adjust the contribution of the observed noisy HSI in a denoising model. The similar concept of "weighting" has been hinted in several methods. Due to the unknown nature of the noise distribution, the implementation of "weighting" in these works are usually achieved via empirical formula for specific denoising method. In this work, we propose to predict the weight by a hyper-weight network (i.e., HWnet). The HWnet is learned exactly from several model-based HSI denoising methods in a bi-level optimization framework based on the data-driven methodology. For a noisy HSI, the learned HWnet outputs its corresponding weight. Then the weighted data fidelity term implemented with the predicted weight can be explicitly combined with a target model-based HSI denoising method. In this way, our HWnet achieves the goal of enhancing the noise adaptation ability of model-based HSI denoising methods for different noisy HSIs. Extensive experiments verify that the proposed HWnet can effecitvely help to improve the ability of an HSI denoising model to handle different complex noises. This further implies that our HWnet could transfer the noise knowledge at the model level and we also study the corresponding generalization theory for simple illustration.
[38] arXiv:2403.03860 (replaced) [pdf, html, other]: Title: ProxNF: Neural Field Proximal Training for High-Resolution 4D Dynamic Image Reconstruction

Luke Lozenski, Refik Mert Cam, Mark D. Pagel, Mark A. Anastasio, Umberto Villa

Comments: 16 pages, 12 figures

Subjects: Image and Video Processing (eess.IV)

Accurate spatiotemporal image reconstruction methods are needed for a wide range of biomedical research areas but face challenges due to data incompleteness and computational burden. Data incompleteness arises from the undersampling often required to increase frame rates, while computational burden emerges due to the memory footprint of high-resolution images with three spatial dimensions and extended time horizons. Neural fields (NFs), an emerging class of neural networks that act as continuous representations of spatiotemporal objects, have previously been introduced to solve these dynamic imaging problems by reframing image reconstruction as a problem of estimating network parameters. Neural fields can address the twin challenges of data incompleteness and computational burden by exploiting underlying redundancies in these spatiotemporal objects. This work proposes ProxNF, a novel neural field training approach for spatiotemporal image reconstruction leveraging proximal splitting methods to separate computations involving the imaging operator from updates of the network parameters. Specifically, ProxNF evaluates the (subsampled) gradient of the data-fidelity term in the image domain and uses a fully supervised learning approach to update the neural field parameters. This method is demonstrated in two numerical phantom studies and an in-vivo application to tumor perfusion imaging in small animal models using dynamic contrast-enhanced photoacoustic computed tomography (DCE PACT).
[39] arXiv:2403.09136 (replaced) [pdf, html, other]: Title: Biophysics Informed Pathological Regularisation for Brain Tumour Segmentation

Lipei Zhang, Yanqi Cheng, Lihao Liu, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero

Comments: 11 pages, 4 figures and 1 table. Accepted by MICCAI2024

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Recent advances in deep learning have significantly improved brain tumour segmentation techniques; however, the results still lack confidence and robustness as they solely consider image data without biophysical priors or pathological information. Integrating biophysics-informed regularisation is one effective way to change this situation, as it provides an prior regularisation for automated end-to-end learning. In this paper, we propose a novel approach that designs brain tumour growth Partial Differential Equation (PDE) models as a regularisation with deep learning, operational with any network model. Our method introduces tumour growth PDE models directly into the segmentation process, improving accuracy and robustness, especially in data-scarce scenarios. This system estimates tumour cell density using a periodic activation function. By effectively integrating this estimation with biophysical models, we achieve better capture of tumour characteristics. This approach not only aligns the segmentation closer to actual biological behaviour but also strengthens the model's performance under limited data conditions. We demonstrate the effectiveness of our framework through extensive experiments on the BraTS 2023 dataset, showcasing significant improvements in both precision and reliability of tumour segmentation.
[40] arXiv:2403.11001 (replaced) [pdf, html, other]: Title: Topologically Faithful Multi-class Segmentation in Medical Images

Alexander H. Berger, Nico Stucki, Laurin Lux, Vincent Buergin, Suprosanna Shit, Anna Banaszak, Daniel Rueckert, Ulrich Bauer, Johannes C. Paetzold

Journal-ref: MICCAI 2024, Lecture Notes in Computer Science, vol. 15008, pp. 721-731, 2024

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Topological accuracy in medical image segmentation is a highly important property for downstream applications such as network analysis and flow modeling in vessels or cell counting. Recently, significant methodological advancements have brought well-founded concepts from algebraic topology to binary segmentation. However, these approaches have been underexplored in multi-class segmentation scenarios, where topological errors are common. We propose a general loss function for topologically faithful multi-class segmentation extending the recent Betti matching concept, which is based on induced matchings of persistence barcodes. We project the N-class segmentation problem to N single-class segmentation tasks, which allows us to use 1-parameter persistent homology, making training of neural networks computationally feasible. We validate our method on a comprehensive set of four medical datasets with highly variant topological characteristics. Our loss formulation significantly enhances topological correctness in cardiac, cell, artery-vein, and Circle of Willis segmentation.
[41] arXiv:2404.08580 (replaced) [pdf, html, other]: Title: Lossy Image Compression with Foundation Diffusion Models

Lucas Relic, Roberto Azevedo, Markus Gross, Christopher Schroers

Comments: Presented at ECCV 2024. This version of the contribution has been accepted for publication, after peer review (when applicable) but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: this https URL

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Incorporating diffusion models in the image compression domain has the potential to produce realistic and detailed reconstructions, especially at extremely low bitrates. Previous methods focus on using diffusion models as expressive decoders robust to quantization errors in the conditioning signals, yet achieving competitive results in this manner requires costly training of the diffusion model and long inference times due to the iterative generative process. In this work we formulate the removal of quantization error as a denoising task, using diffusion to recover lost information in the transmitted image latent. Our approach allows us to perform less than 10% of the full diffusion generative process and requires no architectural changes to the diffusion model, enabling the use of foundation models as a strong prior without additional fine tuning of the backbone. Our proposed codec outperforms previous methods in quantitative realism metrics, and we verify that our reconstructions are qualitatively preferred by end users, even when other methods use twice the bitrate.
[42] arXiv:2409.13442 (replaced) [pdf, other]: Title: Automatic Classification of White Blood Cell Images using Convolutional Neural Network

Rabia Asghar, Arslan Shaukat, Usman Akram, Rimsha Tariq

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Human immune system contains white blood cells (WBC) that are good indicator of many diseases like bacterial infections, AIDS, cancer, spleen, etc. White blood cells have been sub classified into four types: monocytes, lymphocytes, eosinophils and neutrophils on the basis of their nucleus, shape and cytoplasm. Traditionally in laboratories, pathologists and hematologists analyze these blood cells through microscope and then classify them manually. This manual process takes more time and increases the chance of human error. Hence, there is a need to automate this process. In this paper, first we have used different CNN pre-train models such as ResNet-50, InceptionV3, VGG16 and MobileNetV2 to automatically classify the white blood cells. These pre-train models are applied on Kaggle dataset of microscopic images. Although we achieved reasonable accuracy ranging between 92 to 95%, still there is need to enhance the performance. Hence, inspired by these architectures, a framework has been proposed to automatically categorize the four kinds of white blood cells with increased accuracy. The aim is to develop a convolution neural network (CNN) based classification system with decent generalization ability. The proposed CNN model has been tested on white blood cells images from Kaggle and LISC datasets. Accuracy achieved is 99.57% and 98.67% for both datasets respectively. Our proposed convolutional neural network-based model provides competitive performance as compared to previous results reported in literature.
[43] arXiv:2409.14090 (replaced) [pdf, html, other]: Title: Window-based Channel Attention for Wavelet-enhanced Learned Image Compression

Heng Xu, Bowen Hai, Yushun Tang, Zhihai He

Comments: ACCV2024 accepted; camera-ready version

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Learned Image Compression (LIC) models have achieved superior rate-distortion performance than traditional codecs. Existing LIC models use CNN, Transformer, or Mixed CNN-Transformer as basic blocks. However, limited by the shifted window attention, Swin-Transformer-based LIC exhibits a restricted growth of receptive fields, affecting the ability to model large objects for image compression. To address this issue and improve the performance, we incorporate window partition into channel attention for the first time to obtain large receptive fields and capture more global information. Since channel attention hinders local information learning, it is important to extend existing attention mechanisms in Transformer codecs to the space-channel attention to establish multiple receptive fields, being able to capture global correlations with large receptive fields while maintaining detailed characterization of local correlations with small receptive fields. We also incorporate the discrete wavelet transform into our Spatial-Channel Hybrid (SCH) framework for efficient frequency-dependent down-sampling and further enlarging receptive fields. Experiment results demonstrate that our method achieves state-of-the-art performances, reducing BD-rate by 18.54%, 23.98%, 22.33%, and 24.71% on four standard datasets compared to VTM-23.1.
[44] arXiv:2410.02550 (replaced) [pdf, html, other]: Title: NestedMorph: Enhancing Deformable Medical Image Registration with Nested Attention Mechanisms

Gurucharan Marthi Krishna Kumar, Janine Mendola, Amir Shmuel

Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Deformable image registration is crucial for aligning medical images in a non-linear fashion across different modalities, allowing for precise spatial correspondence between varying anatomical structures. This paper presents NestedMorph, a novel network utilizing a Nested Attention Fusion approach to improve intra-subject deformable registration between T1-weighted (T1w) MRI and diffusion MRI (dMRI) data. NestedMorph integrates high-resolution spatial details from an encoder with semantic information from a decoder using a multi-scale framework, enhancing both local and global feature extraction. Our model notably outperforms existing methods, including CNN-based approaches like VoxelMorph, MIDIR, and CycleMorph, as well as Transformer-based models such as TransMorph and ViT-V-Net, and traditional techniques like NiftyReg and SyN. Evaluations on the HCP dataset demonstrate that NestedMorph achieves superior performance across key metrics, including SSIM, HD95, and SDlogJ, with the highest SSIM of 0.89, and the lowest HD95 of 2.5 and SDlogJ of 0.22. These results highlight NestedMorph's ability to capture both local and global image features effectively, leading to superior registration performance. The promising outcomes of this study underscore NestedMorph's potential to significantly advance deformable medical image registration, providing a robust framework for future research and clinical applications. The source code and our implementation are available at: this https URL
[45] arXiv:2307.09857 (replaced) [pdf, other]: Title: Blind Image Quality Assessment Using Multi-Stream Architecture with Spatial and Channel Attention

Muhammad Azeem Aslam, Xu Wei, Hassan Khalid, Nisar Ahmed, Zhu Shuangtong, Xin Liu, Yimei Xu

Comments: Updated article is submitted in Springer Nature scientific reports and will be published

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

BIQA (Blind Image Quality Assessment) is an important field of study that evaluates images automatically. Although significant progress has been made, blind image quality assessment remains a difficult task since images vary in content and distortions. Most algorithms generate quality without emphasizing the important region of interest. In order to solve this, a multi-stream spatial and channel attention-based algorithm is being proposed. This algorithm generates more accurate predictions with a high correlation to human perceptual assessment by combining hybrid features from two different backbones, followed by spatial and channel attention to provide high weights to the region of interest. Four legacy image quality assessment datasets are used to validate the effectiveness of our proposed approach. Authentic and synthetic distortion image databases are used to demonstrate the effectiveness of the proposed method, and we show that it has excellent generalization properties with a particular focus on the perceptual foreground information.
[46] arXiv:2312.15817 (replaced) [pdf, html, other]: Title: A Unified Generative Framework for Realistic Lidar Simulation in Autonomous Driving Systems

Hamed Haghighi, Mehrdad Dianati, Valentina Donzella, Kurt Debattista

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)

Simulation models for perception sensors are integral components of automotive simulators used for the virtual Verification and Validation (V\&V) of Autonomous Driving Systems (ADS). These models also serve as powerful tools for generating synthetic datasets to train deep learning-based perception models. Lidar is a widely used sensor type among the perception sensors for ADS due to its high precision in 3D environment scanning. However, developing realistic Lidar simulation models is a significant technical challenge. In particular, unrealistic models can result in a large gap between the synthesised and real-world point clouds, limiting their effectiveness in ADS applications. Recently, deep generative models have emerged as promising solutions to synthesise realistic sensory data. However, for Lidar simulation, deep generative models have been primarily hybridised with conventional algorithms, leaving unified generative approaches largely unexplored in the literature. Motivated by this research gap, we propose a unified generative framework to enhance Lidar simulation fidelity. Our proposed framework projects Lidar point clouds into depth-reflectance images via a lossless transformation, and employs our novel Controllable Lidar point cloud Generative model, CoLiGen, to translate the images. We extensively evaluate our CoLiGen model, comparing it with the state-of-the-art image-to-image translation models using various metrics to assess the realness, faithfulness, and performance of a downstream perception model. Our results show that CoLiGen exhibits superior performance across most metrics. The dataset and source code for this research are available at this https URL.
[47] arXiv:2405.07777 (replaced) [pdf, html, other]: Title: GMSR:Gradient-Guided Mamba for Spectral Reconstruction from RGB Images

Xinying Wang, Zhixiong Huang, Sifan Zhang, Jiawen Zhu, Paolo Gamba, Lin Feng

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

Mainstream approaches to spectral reconstruction (SR) primarily focus on designing Convolution- and Transformer-based architectures. However, CNN methods often face challenges in handling long-range dependencies, whereas Transformers are constrained by computational efficiency limitations. Recent breakthroughs in state-space model (e.g., Mamba) has attracted significant attention due to its near-linear computational efficiency and superior performance, prompting our investigation into its potential for SR problem. To this end, we propose the Gradient-guided Mamba for Spectral Reconstruction from RGB Images, dubbed GMSR-Net. GMSR-Net is a lightweight model characterized by a global receptive field and linear computational complexity. Its core comprises multiple stacked Gradient Mamba (GM) blocks, each featuring a tri-branch structure. In addition to benefiting from efficient global feature representation by Mamba block, we further innovatively introduce spatial gradient attention and spectral gradient attention to guide the reconstruction of spatial and spectral cues. GMSR-Net demonstrates a significant accuracy-efficiency trade-off, achieving state-of-the-art performance while markedly reducing the number of parameters and computational burdens. Compared to existing approaches, GMSR-Net slashes parameters and FLOPS by substantial margins of 10 times and 20 times, respectively. Code is available at this https URL.

Total of 47 entries

Showing up to 2000 entries per page: fewer | more | all

Image and Video Processing

Showing new listings for Thursday, 10 October 2024

New submissions (showing 21 of 21 entries)

Cross submissions (showing 15 of 15 entries)

Replacement submissions (showing 11 of 11 entries)