Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Showing 1–50 of 218 results for author: Khan, F S

.
  1. arXiv:2407.13772  [pdf, other

    cs.CV

    GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model

    Authors: Abdelrahman Shaker, Syed Talal Wasim, Salman Khan, Juergen Gall, Fahad Shahbaz Khan

    Abstract: Recent advancements in state-space models (SSMs) have showcased effective performance in modeling long-range dependencies with subquadratic complexity. However, pure SSM-based models still face challenges related to stability and achieving optimal performance on computer vision tasks. Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability a… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Preprint. Our code and models are available at: https://github.com/Amshaker/GroupMamba

  2. arXiv:2407.13157  [pdf, other

    cs.CV cs.AI

    Learning Camouflaged Object Detection from Noisy Pseudo Label

    Authors: Jin Zhang, Ruiheng Zhang, Yanjiao Shi, Zhe Cao, Nian Liu, Fahad Shahbaz Khan

    Abstract: Existing Camouflaged Object Detection (COD) methods rely heavily on large-scale pixel-annotated training sets, which are both time-consuming and labor-intensive. Although weakly supervised methods offer higher annotation efficiency, their performance is far behind due to the unclear visual demarcations between foreground and background in camouflaged images. In this paper, we explore the potential… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  3. arXiv:2406.15556  [pdf, other

    cs.CV

    Open-Vocabulary Temporal Action Localization using Multimodal Guidance

    Authors: Akshita Gupta, Aditya Arora, Sanath Narayan, Salman Khan, Fahad Shahbaz Khan, Graham W. Taylor

    Abstract: Open-Vocabulary Temporal Action Localization (OVTAL) enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories. However, this flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference. Unlike standard tempor… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  4. arXiv:2406.10326  [pdf, other

    cs.CV

    VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs

    Authors: Rohit Bharadwaj, Hanan Gani, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan

    Abstract: The recent developments in Large Multi-modal Video Models (Video-LMMs) have significantly enhanced our ability to interpret and analyze video data. Despite their impressive capabilities, current Video-LMMs have not been evaluated for anomaly detection tasks, which is critical to their deployment in practical scenarios e.g., towards identifying deepfakes, manipulated video content, traffic accident… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Data: https://huggingface.co/datasets/rohit901/VANE-Bench

  5. arXiv:2406.09407  [pdf, other

    cs.CV

    Towards Evaluating the Robustness of Visual State Space Models

    Authors: Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Shahbaz Khan, Salman Khan

    Abstract: Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In thi… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  6. arXiv:2406.08486  [pdf, other

    eess.IV cs.CV

    On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models

    Authors: Hashmat Shadab Malik, Numan Saeed, Asif Hanif, Muzammal Naseer, Mohammad Yaqub, Salman Khan, Fahad Shahbaz Khan

    Abstract: Volumetric medical segmentation models have achieved significant success on organ and tumor-based segmentation tasks in recent years. However, their vulnerability to adversarial attacks remains largely unexplored, raising serious concerns regarding the real-world deployment of tools employing such models in the healthcare sector. This underscores the importance of investigating the robustness of e… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  7. arXiv:2406.04844  [pdf, other

    cs.CV

    Multi-Granularity Language-Guided Multi-Object Tracking

    Authors: Yuhao Li, Muzammal Naseer, Jiale Cao, Yu Zhu, Jinqiu Sun, Yanning Zhang, Fahad Shahbaz Khan

    Abstract: Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as o… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  8. arXiv:2406.02548  [pdf, other

    cs.CV

    Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

    Authors: Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this h… ▽ More

    Submitted 20 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

  9. arXiv:2406.00449  [pdf, other

    eess.IV cs.CV

    Dual Hyperspectral Mamba for Efficient Spectral Compressive Imaging

    Authors: Jiahua Dong, Hui Yin, Hongliu Li, Wenbo Li, Yulun Zhang, Salman Khan, Fahad Shahbaz Khan

    Abstract: Deep unfolding methods have made impressive progress in restoring 3D hyperspectral images (HSIs) from 2D measurements through convolution neural networks or Transformers in spectral compressive imaging. However, they cannot efficiently capture long-range dependencies using global receptive fields, which significantly limits their performance in HSI reconstruction. Moreover, these methods may suffe… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

    Comments: 13 pages, 6 figures

  10. arXiv:2405.13278  [pdf, other

    cs.CV physics.med-ph

    Single color virtual H&E staining with In-and-Out Net

    Authors: Mengkun Chen, Yen-Tung Liu, Fadeel Sher Khan, Matthew C. Fox, Jason S. Reichenberg, Fabiana C. P. S. Lopes, Katherine R. Sebastian, Mia K. Markey, James W. Tunnell

    Abstract: Virtual staining streamlines traditional staining procedures by digitally generating stained images from unstained or differently stained images. While conventional staining methods involve time-consuming chemical processes, virtual staining offers an efficient and low infrastructure alternative. Leveraging microscopy-based techniques, such as confocal microscopy, researchers can expedite tissue a… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  11. arXiv:2405.03690  [pdf, other

    cs.CV

    How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

    Authors: Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, Salman Khan

    Abstract: Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives undersco… ▽ More

    Submitted 8 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

    Comments: Technical report

  12. arXiv:2404.14808  [pdf, other

    cs.CV

    Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning

    Authors: Wenjin Hou, Shiming Chen, Shuhuang Chen, Ziming Hong, Yan Wang, Xuetao Feng, Salman Khan, Fahad Shahbaz Khan, Xinge You

    Abstract: Generative Zero-shot learning (ZSL) learns a generator to synthesize visual samples for unseen classes, which is an effective way to advance ZSL. However, existing generative methods rely on the conditions of Gaussian noise and the predefined semantic prototype, which limit the generator only optimized on specific seen classes rather than characterizing each visual instance, resulting in poor gene… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  13. arXiv:2404.10146  [pdf, ps, other

    cs.CV

    Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels

    Authors: Amaya Dharmasiri, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models. However, current methods require supervised pre-training for such alignment, and the performance of such 3D zero-shot models remains sub-optimal for real-world adaptation. In this work, we propose an optimization framework: Cross-MoST: Cross-Modal S… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

    Comments: To be published in Workshop for Learning 3D with Multi-View Supervision (3DMV) at CVPR 2024

  14. arXiv:2404.07713  [pdf, other

    cs.CV cs.LG

    Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

    Authors: Shiming Chen, Wenjin Hou, Salman Khan, Fahad Shahbaz Khan

    Abstract: Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for r… ▽ More

    Submitted 22 July, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

    Comments: Accepted to CVPR'24

  15. arXiv:2404.02154  [pdf, other

    cs.CV

    Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration

    Authors: Akshay Dudhane, Omkar Thawakar, Syed Waqas Zamir, Salman Khan, Fahad Shahbaz Khan, Ming-Hsuan Yang

    Abstract: All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation. The requirement to tackle multiple degradations using the same model can lead to high-complexity designs with fixed configuration that lack the adaptability to more efficient alternatives. We propose DyNet, a dynamic family of networks… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  16. arXiv:2404.01272  [pdf, other

    cs.CV

    Language Guided Domain Generalized Medical Image Segmentation

    Authors: Shahina Kunhimon, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Single source domain generalization (SDG) holds promise for more reliable and consistent image segmentation across real-world clinical settings particularly in the medical domain, where data privacy and acquisition cost constraints often limit the availability of diverse datasets. Depending solely on visual features hampers the model's capacity to adapt effectively to various domains, primarily be… ▽ More

    Submitted 3 April, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

    Comments: Accepted at ISBI2024

  17. arXiv:2403.17937  [pdf, other

    cs.CV

    Efficient Video Object Segmentation via Modulated Cross-Attention Memory

    Authors: Abdelrahman Shaker, Syed Talal Wasim, Martin Danelljan, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

    Abstract: Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attenti… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  18. arXiv:2403.17909  [pdf, other

    cs.CV

    ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection

    Authors: Mubashir Noman, Mustansar Fiaz, Hisham Cholakkal, Salman Khan, Fahad Shahbaz Khan

    Abstract: Deep learning has shown remarkable success in remote sensing change detection (CD), aiming to identify semantic change regions between co-registered satellite image pairs acquired at distinct time stamps. However, existing convolutional neural network and transformer-based frameworks often struggle to accurately segment semantic change regions. Moreover, transformers-based methods with standard se… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: accepted at IEEE TGRS

  19. arXiv:2403.16997  [pdf, other

    cs.CV

    Composed Video Retrieval via Enriched Context and Discriminative Embeddings

    Authors: Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah, Fahad Shahbaz Khan

    Abstract: Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich qu… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: CVPR-2024

  20. arXiv:2403.14743  [pdf, other

    cs.CV

    VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding

    Authors: Ahmad Mahmood, Ashmal Vayani, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as reasoning modules that can deconstruct complex tasks into more manageable sub-tasks, particularly when applied to visual reasoning tasks for images. In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs. Ours is a novel approach to extend the… ▽ More

    Submitted 24 March, 2024; v1 submitted 21 March, 2024; originally announced March 2024.

  21. arXiv:2403.14616  [pdf, other

    cs.CV

    Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning

    Authors: Hasindri Watawana, Kanchana Ranasinghe, Tariq Mahmood, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Self-supervised representation learning has been highly promising for histopathology image analysis with numerous approaches leveraging their patient-slide-patch hierarchy to learn better representations. In this paper, we explore how the combination of domain specific natural language information with such hierarchical visual representations can benefit rich representation learning for medical im… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: 13 pages and 5 figures

  22. arXiv:2403.14614  [pdf, other

    cs.CV

    AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation

    Authors: Yuning Cui, Syed Waqas Zamir, Salman Khan, Alois Knoll, Mubarak Shah, Fahad Shahbaz Khan

    Abstract: In the image acquisition process, various forms of degradation, including noise, haze, and rain, are frequently introduced. These degradations typically arise from the inherent limitations of cameras or unfavorable ambient conditions. To recover clean images from degraded versions, numerous specialized restoration methods have been developed, each targeting a specific type of degradation. Recently… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: 28 pages,15 figures

  23. arXiv:2403.05419  [pdf, other

    cs.CV

    Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

    Authors: Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwar, Salman Khan, Fahad Shahbaz Khan

    Abstract: Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks by pre-training on large amount of unlabelled data. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. Different from standard natural image datasets, remote… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

    Comments: Accepted at CVPR 2024

  24. arXiv:2403.04701  [pdf, other

    cs.CV cs.AI

    ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes

    Authors: Hashmat Shadab Malik, Muhammad Huzaifa, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthet… ▽ More

    Submitted 26 March, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

  25. arXiv:2403.04306  [pdf, other

    cs.CV cs.AI cs.LG

    Effectiveness Assessment of Recent Large Vision-Language Models

    Authors: Yao Jiang, Xinyu Yan, Ge-Peng Ji, Keren Fu, Meijun Sun, Huan Xiong, Deng-Ping Fan, Fahad Shahbaz Khan

    Abstract: The advent of large vision-language models (LVLMs) represents a remarkable advance in the quest for artificial general intelligence. However, the model's effectiveness in both specialized and general tasks warrants further investigation. This paper endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive understanding of… ▽ More

    Submitted 11 June, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

    Comments: Accepted by Visual Intelligence

  26. arXiv:2402.16840  [pdf, other

    cs.CL

    MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

    Authors: Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan

    Abstract: "Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development. However, LLMs do not suit well for scenarios that require on-device processing, energy efficiency, low memory footprint, and response efficiency. These requisites are crucial for privacy, security, and sustainable deployment. This paper explores the "less is more" paradigm by addressing the chall… ▽ More

    Submitted 26 February, 2024; originally announced February 2024.

    Comments: Code available at : https://github.com/mbzuai-oryx/MobiLlama

  27. Semi-supervised Open-World Object Detection

    Authors: Sahal Shaji Mullappilly, Abhishek Singh Gehlot, Rao Muhammad Anwer, Fahad Shahbaz Khan, Hisham Cholakkal

    Abstract: Conventional open-world object detection (OWOD) problem setting first distinguishes known and unknown classes and then later incrementally learns the unknown objects when introduced with labels in the subsequent tasks. However, the current OWOD formulation heavily relies on the external human oracle for knowledge input during the incremental learning stages. Such reliance on run-time makes this fo… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

    Comments: Accepted to AAAI 2024 (Main Track)

    Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence 2024

  28. arXiv:2402.14818  [pdf, other

    cs.CL cs.CV

    PALO: A Polyglot Large Multimodal Model for 5B People

    Authors: Muhammad Maaz, Hanoona Rasheed, Abdelrahman Shaker, Salman Khan, Hisham Cholakal, Rao M. Anwer, Tim Baldwin, Michael Felsberg, Fahad S. Khan

    Abstract: In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semi-automated tr… ▽ More

    Submitted 5 March, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: Technical Report of PALO

  29. arXiv:2402.13253  [pdf, other

    cs.CL

    BiMediX: Bilingual Medical Mixture of Experts LLM

    Authors: Sara Pieri, Sahal Shaji Mullappilly, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal

    Abstract: In this paper, we introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic. Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details such as patient symptoms and medical history, multiple-choice question answering, and open-ended question… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  30. arXiv:2402.05375  [pdf, other

    cs.CV

    Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

    Authors: Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin Hou, Yaxing Wang, Jian Yang

    Abstract: The success of recent text-to-image diffusion models is largely due to their capacity to be guided by a complex text prompt, which enables users to precisely describe the desired content. However, these models struggle to effectively suppress the generation of undesired content, which is explicitly requested to be omitted from the generated image in the prompt. In this paper, we analyze how to man… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

    Comments: ICLR 2024. Our code is available in https://github.com/sen-mao/SuppressEOT

  31. arXiv:2401.00901  [pdf, other

    cs.CV

    Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

    Authors: Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

    Abstract: Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabula… ▽ More

    Submitted 29 March, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

  32. arXiv:2312.09608  [pdf, other

    cs.CV

    Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models

    Authors: Senmao Li, Taihang Hu, Fahad Shahbaz Khan, Linxuan Li, Shiqi Yang, Yaxing Wang, Ming-Ming Cheng, Jian Yang

    Abstract: One of the key components within diffusion models is the UNet for noise prediction. While several works have explored basic properties of the UNet decoder, its encoder largely remains unexplored. In this work, we conduct the first comprehensive study of the UNet encoder. We empirically analyze the encoder features and provide insights to important questions regarding their changes at the inference… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

  33. Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored Arabic LLM

    Authors: Sahal Shaji Mullappilly, Abdelrahman Shaker, Omkar Thawakar, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan

    Abstract: Climate change is one of the most significant challenges we face together as a society. Creating awareness and educating policy makers the wide-ranging impact of climate change is an essential step towards a sustainable future. Recently, Large Language Models (LLMs) like ChatGPT and Bard have shown impressive conversational abilities and excel in a wide variety of NLP tasks. While these models are… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Accepted to EMNLP 2023 (Findings)

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14126-14136

  34. arXiv:2311.15826  [pdf, other

    cs.CV cs.AI

    GeoChat: Grounded Large Vision-Language Model for Remote Sensing

    Authors: Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, Fahad Shahbaz Khan

    Abstract: Recent advancements in Large Vision-Language Models (VLMs) have shown great promise in natural image domains, allowing users to hold a dialogue about given visual content. However, such general-domain VLMs perform poorly for Remote Sensing (RS) scenarios, leading to inaccurate or fabricated information when presented with RS domain-specific queries. Such a behavior emerges due to the unique challe… ▽ More

    Submitted 24 November, 2023; originally announced November 2023.

    Comments: 10 pages, 4 figures

  35. arXiv:2311.15537  [pdf, other

    cs.CV

    SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation

    Authors: Bin Xie, Jiale Cao, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang

    Abstract: Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories. Most existing methods explore utilizing pre-trained vision-language models, in which the key is to adopt the image-level model for pixel-level segmentation task. In this paper, we propose a simple encoder-decoder, named SED, for open-vocabulary semantic segmentation, w… ▽ More

    Submitted 27 February, 2024; v1 submitted 27 November, 2023; originally announced November 2023.

    Comments: Accepted by CVPR2024

  36. arXiv:2311.12068  [pdf, other

    cs.CV cs.AI cs.LG

    Enhancing Novel Object Detection via Cooperative Foundational Models

    Authors: Rohit Bharadwaj, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

    Abstract: In this work, we address the challenging and emergent problem of novel object detection (NOD), focusing on the accurate detection of both known and novel object categories during inference. Traditional object detection algorithms are inherently closed-set, limiting their capability to handle NOD. We present a novel approach to transform existing closed-set detectors into open-set detectors. This t… ▽ More

    Submitted 21 November, 2023; v1 submitted 19 November, 2023; originally announced November 2023.

    Comments: Code: https://github.com/rohit901/cooperative-foundational-models

  37. arXiv:2311.03570  [pdf, other

    cs.CV

    Cal-DETR: Calibrated Detection Transformer

    Authors: Muhammad Akhtar Munir, Salman Khan, Muhammad Haris Khan, Mohsen Ali, Fahad Shahbaz Khan

    Abstract: Albeit revealing impressive predictive performance for several computer vision tasks, deep neural networks (DNNs) are prone to making overconfident predictions. This limits the adoption and wider utilization of DNNs in many safety-critical applications. There have been recent efforts toward calibrating DNNs, however, almost all of them focus on the classification task. Surprisingly, very little at… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: Accepted at NeurIPS 2023

  38. arXiv:2311.03356  [pdf, other

    cs.CV cs.AI

    GLaMM: Pixel Grounding Large Multimodal Model

    Authors: Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan

    Abstract: Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dens… ▽ More

    Submitted 1 June, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

    Comments: CVPR 2024

  39. arXiv:2311.01459  [pdf, other

    cs.CV

    Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization

    Authors: Jameel Hassan, Hanan Gani, Noor Hussein, Muhammad Uzair Khattak, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan

    Abstract: The promising zero-shot generalization of vision-language models such as CLIP has led to their adoption using prompt learning for numerous downstream tasks. Previous works have shown test-time prompt tuning using entropy minimization to adapt text prompts for unseen domains. While effective, this overlooks the key cause for performance degradation to unseen domains -- distribution shift. In this w… ▽ More

    Submitted 10 January, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

    Comments: Accepted to NeurIPS 2023

  40. arXiv:2310.15324  [pdf, other

    cs.CV

    Videoprompter: an ensemble of foundational models for zero-shot video understanding

    Authors: Adeel Yousaf, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah

    Abstract: Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. Recently, large language models (LLMs) have been used to enrich the text-based class labels by enhancing the descriptiveness of the class names. However, these improvements are restricted to the text-based classifier only, and the query vi… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  41. arXiv:2309.11160  [pdf, other

    cs.CV

    Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation

    Authors: Nian Liu, Kepan Nan, Wangbo Zhao, Yuanwei Liu, Xiwen Yao, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Junwei Han, Fahad Shahbaz Khan

    Abstract: Few-Shot Video Object Segmentation (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images. However, this task was seldom explored. In this work, based on IPMT, a state-of-the-art few-shot image segmentation method that combines external support guidance information with adaptive query guidance cues, we propose to leverage multi-grained tem… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  42. arXiv:2309.10518  [pdf, other

    cs.CV

    Unsupervised Landmark Discovery Using Consistency Guided Bottleneck

    Authors: Mamona Awan, Muhammad Haris Khan, Sanoojan Baliah, Muhammad Ahmad Waseem, Salman Khan, Fahad Shahbaz Khan, Arif Mahmood

    Abstract: We study a challenging problem of unsupervised discovery of object landmarks. Many recent methods rely on bottlenecks to generate 2D Gaussian heatmaps however, these are limited in generating informed heatmaps while training, presumably due to the lack of effective structural cues. Also, it is assumed that all predicted landmarks are semantically relevant despite having no ground truth supervision… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: Accepted ORAL at BMVC 2023 ; Code: https://github.com/MamonaAwan/CGB_ULD

    ACM Class: I.4

  43. arXiv:2309.04702  [pdf, other

    cs.CV

    A Spatial-Temporal Deformable Attention based Framework for Breast Lesion Detection in Videos

    Authors: Chao Qin, Jiale Cao, Huazhu Fu, Rao Muhammad Anwer, Fahad Shahbaz Khan

    Abstract: Detecting breast lesion in videos is crucial for computer-aided diagnosis. Existing video-based breast lesion detection approaches typically perform temporal feature aggregation of deep backbone features based on the self-attention operation. We argue that such a strategy struggles to effectively perform deep feature aggregation and ignores the useful local information. To tackle these issues, we… ▽ More

    Submitted 9 September, 2023; originally announced September 2023.

    Comments: Accepted by MICCAI 2023

  44. arXiv:2308.15816  [pdf, other

    cs.CV

    Improving Underwater Visual Tracking With a Large Scale Dataset and Image Enhancement

    Authors: Basit Alawode, Fayaz Ali Dharejo, Mehnaz Ummar, Yuhang Guo, Arif Mahmood, Naoufel Werghi, Fahad Shahbaz Khan, Jiri Matas, Sajid Javed

    Abstract: This paper presents a new dataset and general tracker enhancement method for Underwater Visual Object Tracking (UVOT). Despite its significance, underwater tracking has remained unexplored due to data inaccessibility. It poses distinct challenges; the underwater environment exhibits non-uniform lighting conditions, low visibility, lack of sharpness, low contrast, camouflage, and reflections from s… ▽ More

    Submitted 31 August, 2023; v1 submitted 30 August, 2023; originally announced August 2023.

  45. How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges

    Authors: Haotong Qin, Ge-Peng Ji, Salman Khan, Deng-Ping Fan, Fahad Shahbaz Khan, Luc Van Gool

    Abstract: Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI. Notably, Bard has recently been updated to handle visual inputs alongside text prompts during conversations. Given Bard's impressive track record in handling textual inputs, we explore its capabilities in understanding and interpreting visual data (images) conditioned by text questions. This… ▽ More

    Submitted 30 August, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

    Journal ref: Machine Intelligence Research. 20(5), October 2023, 605-613

  46. arXiv:2307.13721  [pdf, other

    cs.CV cs.AI

    Foundational Models Defining a New Era in Vision: A Survey and Outlook

    Authors: Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Shahbaz Khan

    Abstract: Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge… ▽ More

    Submitted 25 July, 2023; originally announced July 2023.

    Comments: Project page: https://github.com/awaisrauf/Awesome-CV-Foundational-Models

  47. arXiv:2307.07269  [pdf, other

    eess.IV cs.CV cs.LG

    Frequency Domain Adversarial Training for Robust Volumetric Medical Segmentation

    Authors: Asif Hanif, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan

    Abstract: It is imperative to ensure the robustness of deep learning models in critical applications such as, healthcare. While recent advances in deep learning have improved the performance of volumetric medical image segmentation models, these models cannot be deployed for real-world applications immediately due to their vulnerability to adversarial attacks. We present a 3D frequency domain adversarial at… ▽ More

    Submitted 20 July, 2023; v1 submitted 14 July, 2023; originally announced July 2023.

    Comments: This paper has been accepted in MICCAI 2023 conference

  48. arXiv:2307.06948  [pdf, other

    cs.CV

    Self-regulating Prompts: Foundational Model Adaptation without Forgetting

    Authors: Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

    Abstract: Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. Conventionally trained using the task-specific objective, i.e., cross-entropy loss, prompts tend to overfit downstream data distributions and find it challenging to capture task-agnostic general features from the frozen CLIP. This leads to the loss of the model's… ▽ More

    Submitted 24 August, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

    Comments: Accepted to ICCV-2023. Camera-Ready version. Project page: https://muzairkhattak.github.io/PromptSRC/

  49. arXiv:2307.06947  [pdf, other

    cs.CV cs.AI

    Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

    Authors: Syed Talal Wasim, Muhammad Uzair Khattak, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan

    Abstract: Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work prop… ▽ More

    Submitted 27 October, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

    Comments: Accepted to ICCV-2023. Camera-Ready version. Project page: https://TalalWasim.github.io/Video-FocalNets/

  50. arXiv:2306.14255  [pdf, other

    eess.IV cs.CV

    AttResDU-Net: Medical Image Segmentation Using Attention-based Residual Double U-Net

    Authors: Akib Mohammed Khan, Alif Ashrafee, Fahim Shahriar Khan, Md. Bakhtiar Hasan, Md. Hasanul Kabir

    Abstract: Manually inspecting polyps from a colonoscopy for colorectal cancer or performing a biopsy on skin lesions for skin cancer are time-consuming, laborious, and complex procedures. Automatic medical image segmentation aims to expedite this diagnosis process. However, numerous challenges exist due to significant variations in the appearance and sizes of objects with no distinct boundaries. This paper… ▽ More

    Submitted 25 June, 2023; originally announced June 2023.

    Comments: Accepted in 2023 International Joint Conference on Neural Networks (IJCNN 2023)