Search | arXiv e-print repository

VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Authors: Darshana Saravanan, Darshan Singh, Varun Gupta, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi

Abstract: Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships.… ▽ More Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships. To this end, we propose VELOCITI, a new benchmark building on complex movie clips and dense semantic role label annotations to test perception and binding in video language models (contrastive and Video-LLMs). Our perception-based tests require discriminating video-caption pairs that share similar entities, and the binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video. While current state-of-the-art models perform moderately well on perception tests, accuracy is near random when both entities are present in the same video, indicating that they fail at binding tests. Even the powerful Gemini 1.5 Flash has a substantial gap (16-28%) with respect to human accuracy in such binding tests. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 26 pages, 17 figures, 3 tables

arXiv:2405.17788 [pdf, other]

Enhancing Road Safety: Real-Time Detection of Driver Distraction through Convolutional Neural Networks

Authors: Amaan Aijaz Sheikh, Imaad Zaffar Khan

Abstract: As we navigate our daily commutes, the threat posed by a distracted driver is at a large, resulting in a troubling rise in traffic accidents. Addressing this safety concern, our project harnesses the analytical power of Convolutional Neural Networks (CNNs), with a particular emphasis on the well-established models VGG16 and VGG19. These models are acclaimed for their precision in image recognition… ▽ More As we navigate our daily commutes, the threat posed by a distracted driver is at a large, resulting in a troubling rise in traffic accidents. Addressing this safety concern, our project harnesses the analytical power of Convolutional Neural Networks (CNNs), with a particular emphasis on the well-established models VGG16 and VGG19. These models are acclaimed for their precision in image recognition and are meticulously tested for their ability to detect nuances in driver behavior under varying environmental conditions. Through a comparative analysis against an array of CNN architectures, this study seeks to identify the most efficient model for real-time detection of driver distractions. The ultimate aim is to incorporate the findings into vehicle safety systems, significantly boosting their capability to prevent accidents triggered by inattention. This research not only enhances our understanding of automotive safety technologies but also marks a pivotal step towards creating vehicles that are intuitively aligned with driver behaviors, ensuring safer roads for all. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.13949 [pdf, other]

PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

Authors: Runlong He, Mengya Xu, Adrito Das, Danyal Z. Khan, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarakol Islam

Abstract: Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the i… ▽ More Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the image and text modalities remains an open research challenge due to the inherent differences between these two types of information and the complexity involved in aligning them. This paper introduces PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical aspects such as phase and step recognition, context understanding, tool detection and localization, and tool-tissue interactions. PitVQA-Net consists of a novel image-grounded text embedding that projects image and text features into a shared embedding space and GPT2 Backbone with an excitation block classification head to generate contextually relevant answers within the complex domain of endonasal pituitary surgery. Our image-grounded text embedding leverages joint embedding, cross-attention and contextual representation to understand the contextual relationship between questions and surgical images. We demonstrate the effectiveness of PitVQA-Net on both the PitVQA and the publicly available EndoVis18-VQA dataset, achieving improvements in balanced accuracy of 8% and 9% over the most recent baselines, respectively. Our code and dataset is available at https://github.com/mobarakol/PitVQA. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 10 pages, 3 figures

arXiv:2405.11483 [pdf, other]

MICap: A Unified Model for Identity-aware Movie Descriptions

Authors: Haran Raajesh, Naveen Reddy Desanur, Zeeshan Khan, Makarand Tapaswi

Abstract: Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id label… ▽ More Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: CVPR 2024, Project Page: https://katha-ai.github.io/projects/micap/

arXiv:2404.10193 [pdf, other]

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Authors: Zaid Khan, Yun Fu

Abstract: The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only avail… ▽ More The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic, black-box setting. We propose using the principle of \textit{neighborhood consistency} to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2404.04627 [pdf, other]

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, Manmohan Chandraker

Abstract: Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training… ▽ More Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: https://zaidkhan.me/ViReP △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2403.09715 [pdf, other]

doi 10.1109/ICECCE49384.2020.9179319

Textual analysis of End User License Agreement for red-flagging potentially malicious software

Authors: Behraj Khan, Tahir Syed, Zeshan Khan, Muhammad Rafi

Abstract: New software and updates are downloaded by end users every day. Each dowloaded software has associated with it an End Users License Agreements (EULA), but this is rarely read. An EULA includes information to avoid legal repercussions. However,this proposes a host of potential problems such as spyware or producing an unwanted affect in the target system. End users do not read these EULA's because o… ▽ More New software and updates are downloaded by end users every day. Each dowloaded software has associated with it an End Users License Agreements (EULA), but this is rarely read. An EULA includes information to avoid legal repercussions. However,this proposes a host of potential problems such as spyware or producing an unwanted affect in the target system. End users do not read these EULA's because of length of the document and users find it extremely difficult to understand. Text summarization is one of the relevant solution to these kind of problems. This require a solution which can summarize the EULA and classify the EULA as "Benign" or "Malicious". We propose a solution in which we have summarize the EULA and classify the EULA as "Benign" or "Malicious". We extract EULA text of different sofware's then we classify the text using eight different supervised classifiers. we use ensemble learning to classify the EULA as benign or malicious using five different text summarization methods. An accuracy of $95.8$\% shows the effectiveness of the presented approach. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2402.05126 [pdf, other]

Graph Neural Network and NER-Based Text Summarization

Authors: Imaad Zaffar Khan, Amaan Aijaz Sheikh, Utkarsh Sinha

Abstract: With the abundance of data and information in todays time, it is nearly impossible for man, or, even machine, to go through all of the data line by line. What one usually does is to try to skim through the lines and retain the absolutely important information, that in a more formal term is called summarization. Text summarization is an important task that aims to compress lengthy documents or arti… ▽ More With the abundance of data and information in todays time, it is nearly impossible for man, or, even machine, to go through all of the data line by line. What one usually does is to try to skim through the lines and retain the absolutely important information, that in a more formal term is called summarization. Text summarization is an important task that aims to compress lengthy documents or articles into shorter, coherent representations while preserving the core information and meaning. This project introduces an innovative approach to text summarization, leveraging the capabilities of Graph Neural Networks (GNNs) and Named Entity Recognition (NER) systems. GNNs, with their exceptional ability to capture and process the relational data inherent in textual information, are adept at understanding the complex structures within large documents. Meanwhile, NER systems contribute by identifying and emphasizing key entities, ensuring that the summarization process maintains a focus on the most critical aspects of the text. By integrating these two technologies, our method aims to enhances the efficiency of summarization and also tries to ensures a high degree relevance in the condensed content. This project, therefore, offers a promising direction for handling the ever increasing volume of textual data in an information-saturated world. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.12667 [pdf, ps, other]

Feature Selection via Robust Weighted Score for High Dimensional Binary Class-Imbalanced Gene Expression Data

Authors: Zardad Khan, Amjad Ali, Saeed Aldahmani

Abstract: In this paper, a robust weighted score for unbalanced data (ROWSU) is proposed for selecting the most discriminative feature for high dimensional gene expression binary classification with class-imbalance problem. The method addresses one of the most challenging problems of highly skewed class distributions in gene expression datasets that adversely affect the performance of classification algorit… ▽ More In this paper, a robust weighted score for unbalanced data (ROWSU) is proposed for selecting the most discriminative feature for high dimensional gene expression binary classification with class-imbalance problem. The method addresses one of the most challenging problems of highly skewed class distributions in gene expression datasets that adversely affect the performance of classification algorithms. First, the training dataset is balanced by synthetically generating data points from minority class observations. Second, a minimum subset of genes is selected using a greedy search approach. Third, a novel weighted robust score, where the weights are computed by support vectors, is introduced to obtain a refined set of genes. The highest-scoring genes based on this approach are combined with the minimum subset of genes selected by the greedy search approach to form the final set of genes. The novel method ensures the selection of the most discriminative genes, even in the presence of skewed class distribution, thus improving the performance of the classifiers. The performance of the proposed ROWSU method is evaluated on $6$ gene expression datasets. Classification accuracy and sensitivity are used as performance metrics to compare the proposed ROWSU algorithm with several other state-of-the-art methods. Boxplots and stability plots are also constructed for a better understanding of the results. The results show that the proposed method outperforms the existing feature selection procedures based on classification performance from k nearest neighbours (kNN) and random forest (RF) classifiers. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 25 pages

MSC Class: 14J60

arXiv:2401.07669 [pdf, other]

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Authors: Darshan Singh S, Zeeshan Khan, Makarand Tapaswi

Abstract: While contrastive language image pretraining (CLIP) have exhibited impressive performance by learning highly semantic and generalized representations, recent works have exposed a fundamental drawback in its syntactic properties, that includes interpreting fine-grained attributes, actions, spatial relations, states, and details that require compositional reasoning. One reason for this is that natur… ▽ More While contrastive language image pretraining (CLIP) have exhibited impressive performance by learning highly semantic and generalized representations, recent works have exposed a fundamental drawback in its syntactic properties, that includes interpreting fine-grained attributes, actions, spatial relations, states, and details that require compositional reasoning. One reason for this is that natural captions often do not capture all the visual details of a scene. This leads to unaddressed visual concepts being misattributed to the wrong words. And the pooled image and text features, ends up acting as a bag of words, hence losing the syntactic information. In this work, we ask: Is it possible to enhance CLIP's fine-grained and syntactic abilities without compromising its semantic properties? We show that this is possible by adapting CLIP efficiently on a high-quality, comprehensive, and relatively small dataset. We demonstrate our adaptation strategy on VidSitu, a video situation recognition dataset annotated with verbs and rich semantic role labels (SRL). We use the SRL and verb information to create rule-based detailed captions, making sure they capture most of the visual concepts. Combined with hard negatives and hierarchical losses, these annotations allow us to learn a powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that preserves semantic understanding while being detail-oriented. We evaluate on five diverse vision-language tasks in both fine-tuning and zero-shot settings, achieving consistent improvements over the base CLIP model. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2311.09762 [pdf, other]

Graph Elicitation for Guiding Multi-Step Reasoning in Large Language Models

Authors: Jinyoung Park, Ameen Patel, Omar Zia Khan, Hyunwoo J. Kim, Joo-Kyung Kim

Abstract: Chain-of-Thought (CoT) prompting along with sub-question generation and answering has enhanced multi-step reasoning capabilities of Large Language Models (LLMs). However, prompting the LLMs to directly generate sub-questions is suboptimal since they sometimes generate redundant or irrelevant questions. To deal with them, we propose a GE-Reasoning method, which directs LLMs to generate proper sub-q… ▽ More Chain-of-Thought (CoT) prompting along with sub-question generation and answering has enhanced multi-step reasoning capabilities of Large Language Models (LLMs). However, prompting the LLMs to directly generate sub-questions is suboptimal since they sometimes generate redundant or irrelevant questions. To deal with them, we propose a GE-Reasoning method, which directs LLMs to generate proper sub-questions and corresponding answers. Concretely, given an input question, we first prompt the LLM to generate knowledge triplets, forming a graph representation of the question. Unlike conventional knowledge triplets, our approach allows variables as head or tail entities, effectively representing a question as knowledge triplets. Second, for each triplet, the LLM generates a corresponding sub-question and answer along with using knowledge retrieval. If the prediction confidence exceeds a threshold, the sub-question and prediction are incorporated into the prompt for subsequent processing. This approach encourages that sub-questions are grounded in the extracted knowledge triplets, reducing redundancy and irrelevance. Our experiments demonstrate that our approach outperforms previous CoT prompting methods and their variants on multi-hop question answering benchmark datasets. △ Less

Submitted 22 June, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

Comments: Preprint

arXiv:2310.20081 [pdf, other]

Integrating Summarization and Retrieval for Enhanced Personalization via Large Language Models

Authors: Chris Richardson, Yao Zhang, Kellen Gillespie, Sudipta Kar, Arshdeep Singh, Zeynab Raeesy, Omar Zia Khan, Abhinav Sethy

Abstract: Personalization, the ability to tailor a system to individual users, is an essential factor in user experience with natural language processing (NLP) systems. With the emergence of Large Language Models (LLMs), a key question is how to leverage these models to better personalize user experiences. To personalize a language model's output, a straightforward approach is to incorporate past user data… ▽ More Personalization, the ability to tailor a system to individual users, is an essential factor in user experience with natural language processing (NLP) systems. With the emergence of Large Language Models (LLMs), a key question is how to leverage these models to better personalize user experiences. To personalize a language model's output, a straightforward approach is to incorporate past user data into the language model prompt, but this approach can result in lengthy inputs exceeding limitations on input length and incurring latency and cost issues. Existing approaches tackle such challenges by selectively extracting relevant user data (i.e. selective retrieval) to construct a prompt for downstream tasks. However, retrieval-based methods are limited by potential information loss, lack of more profound user understanding, and cold-start challenges. To overcome these limitations, we propose a novel summary-augmented approach by extending retrieval-augmented personalization with task-aware user summaries generated by LLMs. The summaries can be generated and stored offline, enabling real-world systems with runtime constraints like voice assistants to leverage the power of LLMs. Experiments show our method with 75% less of retrieved user data is on-par or outperforms retrieval augmentation on most tasks in the LaMP personalization benchmark. We demonstrate that offline summarization via LLMs and runtime retrieval enables better performance for personalization on a range of tasks under practical constraints. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: 4 pages, International Workshop on Personalized Generative AI (@CIKM 2023)

ACM Class: I.2.7; H.3.3

arXiv:2310.17954 [pdf, other]

Multivessel Coronary Artery Segmentation and Stenosis Localisation using Ensemble Learning

Authors: Muhammad Bilal, Dinis Martinho, Reiner Sim, Adnan Qayyum, Hunaid Vohra, Massimo Caputo, Taofeek Akinosho, Sofiat Abioye, Zaheer Khan, Waleed Niaz, Junaid Qadir

Abstract: Coronary angiography analysis is a common clinical task performed by cardiologists to diagnose coronary artery disease (CAD) through an assessment of atherosclerotic plaque's accumulation. This study introduces an end-to-end machine learning solution developed as part of our solution for the MICCAI 2023 Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCA… ▽ More Coronary angiography analysis is a common clinical task performed by cardiologists to diagnose coronary artery disease (CAD) through an assessment of atherosclerotic plaque's accumulation. This study introduces an end-to-end machine learning solution developed as part of our solution for the MICCAI 2023 Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) challenge, which aims to benchmark solutions for multivessel coronary artery segmentation and potential stenotic lesion localisation from X-ray coronary angiograms. We adopted a robust baseline model training strategy to progressively improve performance, comprising five successive stages of binary class pretraining, multivessel segmentation, fine-tuning using class frequency weighted dataloaders, fine-tuning using F1-based curriculum learning strategy (F1-CLS), and finally multi-target angiogram view classifier-based collective adaptation. Unlike many other medical imaging procedures, this task exhibits a notable degree of interobserver variability. %, making it particularly amenable to automated analysis. Our ensemble model combines the outputs from six baseline models using the weighted ensembling approach, which our analysis shows is found to double the predictive accuracy of the proposed solution. The final prediction was further refined, targeting the correction of misclassified blobs. Our solution achieved a mean F1 score of $37.69\%$ for coronary artery segmentation, and $39.41\%$ for stenosis localisation, positioning our team in the 5th position on both leaderboards. This work demonstrates the potential of automated tools to aid CAD diagnosis, guide interventions, and improve the accuracy of stent injections in clinical settings. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: Submission report for ARCADE challenge hosted at MICCAI2023

arXiv:2310.17050 [pdf, other]

Exploring Question Decomposition for Zero-Shot VQA

Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Manmohan Chandraker, Yun Fu

Abstract: Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their… ▽ More Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their own decompositions of visual questions, finding they are capable of learning both tasks from demonstrations alone. However, we show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors, and validate its effectiveness on eight VQA tasks across three domains, showing consistent improvements in accuracy, including improvements of >20% on medical VQA datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA reformulation of the challenging Winoground task. Project Site: https://zaidkhan.me/decomposition-0shot-vqa/ △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023 Camera Ready

arXiv:2310.17032 [pdf, other]

Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting

Authors: Saad Zafar Khan, Nazeefa Muzammil, Salman Ghafoor, Haibat Khan, Syed Mohammad Hasan Zaidi, Abdulah Jeza Aljohani, Imran Aziz

Abstract: Accurate solar power forecasting is pivotal for the global transition towards sustainable energy systems. This study conducts a meticulous comparison between Quantum Long Short-Term Memory (QLSTM) and classical Long Short-Term Memory (LSTM) models for solar power production forecasting. The primary objective is to evaluate the potential advantages of QLSTMs, leveraging their exponential representa… ▽ More Accurate solar power forecasting is pivotal for the global transition towards sustainable energy systems. This study conducts a meticulous comparison between Quantum Long Short-Term Memory (QLSTM) and classical Long Short-Term Memory (LSTM) models for solar power production forecasting. The primary objective is to evaluate the potential advantages of QLSTMs, leveraging their exponential representational capabilities, in capturing the intricate spatiotemporal patterns inherent in renewable energy data. Through controlled experiments on real-world photovoltaic datasets, our findings reveal promising improvements offered by QLSTMs, including accelerated training convergence and substantially reduced test loss within the initial epoch compared to classical LSTMs. These empirical results demonstrate QLSTM's potential to swiftly assimilate complex time series relationships, enabled by quantum phenomena like superposition. However, realizing QLSTM's full capabilities necessitates further research into model validation across diverse conditions, systematic hyperparameter optimization, hardware noise resilience, and applications to correlated renewable forecasting problems. With continued progress, quantum machine learning can offer a paradigm shift in renewable energy time series prediction, potentially ushering in an era of unprecedented accuracy and reliability in solar power forecasting worldwide. This pioneering work provides initial evidence substantiating quantum advantages over classical LSTM models while acknowledging present limitations. Through rigorous benchmarking grounded in real-world data, our study illustrates a promising trajectory for quantum learning in renewable forecasting. △ Less

Submitted 9 April, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: 33 pages, 9 figures

arXiv:2308.15827 [pdf, other]

Introducing Language Guidance in Prompt-based Continual Learning

Authors: Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, Muhammad Zeshan Afzal

Abstract: Continual Learning aims to learn a single model on a sequence of tasks without having access to data from previous tasks. The biggest challenge in the domain still remains catastrophic forgetting: a loss in performance on seen classes of earlier tasks. Some existing methods rely on an expensive replay buffer to store a chunk of data from previous tasks. This, while promising, becomes expensive whe… ▽ More Continual Learning aims to learn a single model on a sequence of tasks without having access to data from previous tasks. The biggest challenge in the domain still remains catastrophic forgetting: a loss in performance on seen classes of earlier tasks. Some existing methods rely on an expensive replay buffer to store a chunk of data from previous tasks. This, while promising, becomes expensive when the number of tasks becomes large or data can not be stored for privacy reasons. As an alternative, prompt-based methods have been proposed that store the task information in a learnable prompt pool. This prompt pool instructs a frozen image encoder on how to solve each task. While the model faces a disjoint set of classes in each task in this setting, we argue that these classes can be encoded to the same embedding space of a pre-trained language encoder. In this work, we propose Language Guidance for Prompt-based Continual Learning (LGCL) as a plug-in for prompt-based methods. LGCL is model agnostic and introduces language guidance at the task level in the prompt pool and at the class level on the output feature of the vision encoder. We show with extensive experimentation that LGCL consistently improves the performance of prompt-based continual learning methods to set a new state-of-the art. LGCL achieves these performance improvements without needing any additional learnable parameters. △ Less

Submitted 30 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023

arXiv:2307.16262 [pdf, other]

Validating polyp and instrument segmentation methods in colonoscopy through Medico 2020 and MedAI 2021 Challenges

Authors: Debesh Jha, Vanshali Sharma, Debapriya Banik, Debayan Bhattacharya, Kaushiki Roy, Steven A. Hicks, Nikhil Kumar Tomar, Vajira Thambawita, Adrian Krenzer, Ge-Peng Ji, Sahadev Poudel, George Batchkala, Saruar Alam, Awadelrahman M. A. Ahmed, Quoc-Huy Trinh, Zeshan Khan, Tien-Phat Nguyen, Shruti Shrestha, Sabari Nathan, Jeonghwan Gwak, Ritika K. Jha, Zheyuan Zhang, Alexander Schlaefer, Debotosh Bhattacharjee, M. K. Bhuyan , et al. (8 additional authors not shown)

Abstract: Automatic analysis of colonoscopy images has been an active field of research motivated by the importance of early detection of precancerous polyps. However, detecting polyps during the live examination can be challenging due to various factors such as variation of skills and experience among the endoscopists, lack of attentiveness, and fatigue leading to a high polyp miss-rate. Deep learning has… ▽ More Automatic analysis of colonoscopy images has been an active field of research motivated by the importance of early detection of precancerous polyps. However, detecting polyps during the live examination can be challenging due to various factors such as variation of skills and experience among the endoscopists, lack of attentiveness, and fatigue leading to a high polyp miss-rate. Deep learning has emerged as a promising solution to this challenge as it can assist endoscopists in detecting and classifying overlooked polyps and abnormalities in real time. In addition to the algorithm's accuracy, transparency and interpretability are crucial to explaining the whys and hows of the algorithm's prediction. Further, most algorithms are developed in private data, closed source, or proprietary software, and methods lack reproducibility. Therefore, to promote the development of efficient and transparent methods, we have organized the "Medico automatic polyp segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image Segmentation (MedAI 2021)" competitions. We present a comprehensive summary and analyze each contribution, highlight the strength of the best-performing methods, and discuss the possibility of clinical translations of such methods into the clinic. For the transparency task, a multi-disciplinary team, including expert gastroenterologists, accessed each submission and evaluated the team based on open-source practices, failure case analysis, ablation studies, usability and understandability of evaluations to gain a deeper understanding of the models' credibility for clinical deployment. Through the comprehensive analysis of the challenge, we not only highlight the advancements in polyp and surgical instrument segmentation but also encourage qualitative evaluation for building more transparent and understandable AI-based colonoscopy systems. △ Less

Submitted 6 May, 2024; v1 submitted 30 July, 2023; originally announced July 2023.

arXiv:2306.03932 [pdf, other]

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Xiang Yu, Yun Fu, Manmohan Chandraker

Abstract: Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose VQA. While collecting additional labels for specialized tasks or domains can be challe… ▽ More Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose VQA. While collecting additional labels for specialized tasks or domains can be challenging, unlabeled images are often available. We introduce SelTDA (Self-Taught Data Augmentation), a strategy for finetuning large VLMs on small-scale VQA datasets. SelTDA uses the VLM and target dataset to build a teacher model that can generate question-answer pseudolabels directly conditioned on an image alone, allowing us to pseudolabel unlabeled images. SelTDA then finetunes the initial VLM on the original dataset augmented with freshly pseudolabeled images. We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions, counterfactual examples and rephrasings, improves domain generalization, and results in greater retention of numerical reasoning skills. The proposed strategy requires no additional annotations or architectural modifications, and is compatible with any modern encoder-decoder multimodal transformer. Code available at https://github.com/codezakh/SelTDA. △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: CVPR 2023

arXiv:2306.01819 [pdf]

Comparative Analysis of Widely use Object-Oriented Languages

Authors: Muhammad Shoaib Farooq, Taymour zaman Khan

Abstract: Programming is an integral part of computer science discipline. Every day the programming environment is not only rapidly growing but also changing and languages are constantly evolving. Learning of object-oriented paradigm is compulsory in every computer science major so the choice of language to teach object-oriented principles is very important. Due to large pool of object-oriented languages, i… ▽ More Programming is an integral part of computer science discipline. Every day the programming environment is not only rapidly growing but also changing and languages are constantly evolving. Learning of object-oriented paradigm is compulsory in every computer science major so the choice of language to teach object-oriented principles is very important. Due to large pool of object-oriented languages, it is difficult to choose which should be the first programming language in order to teach object-oriented principles. Many studies shown which should be the first language to tech object-oriented concepts but there is no method to compare and evaluate these languages. In this article we proposed a comprehensive framework to evaluate the widely used object-oriented languages. The languages are evaluated basis of their technical and environmental features. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: 30 pages, figures 2

arXiv:2305.15897 [pdf, other]

Impact of Log Parsing on Log-based Anomaly Detection

Authors: Zanis Ali Khan, Donghwan Shin, Domenico Bianculli, Lionel Briand

Abstract: Software systems log massive amounts of data, recording important runtime information. Such logs are used, for example, for log-based anomaly detection, which aims to automatically detect abnormal behaviors of the system under analysis by processing the information recorded in its logs. Many log-based anomaly detection techniques based on deep-learning models include a pre-processing step called l… ▽ More Software systems log massive amounts of data, recording important runtime information. Such logs are used, for example, for log-based anomaly detection, which aims to automatically detect abnormal behaviors of the system under analysis by processing the information recorded in its logs. Many log-based anomaly detection techniques based on deep-learning models include a pre-processing step called log parsing. However, understanding the impact of log parsing on the accuracy of anomaly detection techniques has received surprisingly little attention so far. Investigating what are the key properties log parsing techniques should ideally have to help anomaly detection is therefore warranted. In this paper, we report on a comprehensive empirical study on the impact of log parsing on anomaly detection accuracy, using 13 log parsing techniques and five deep-learning-based anomaly detection techniques on two publicly available log datasets. Our empirical results show that, despite what is widely assumed, there is no strong correlation between log parsing accuracy and anomaly detection accuracy (regardless of the metric used for measuring log parsing accuracy). Moreover, we experimentally confirm existing theoretical results showing that it is a property that we refer to as distinguishability in log parsing results as opposed to their accuracy that plays an essential role in achieving accurate anomaly detection. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2305.06934 [pdf, other]

Humans are Still Better than ChatGPT: Case of the IEEEXtreme Competition

Authors: Anis Koubaa, Basit Qureshi, Adel Ammar, Zahid Khan, Wadii Boulila, Lahouari Ghouti

Abstract: Since the release of ChatGPT, numerous studies have highlighted the remarkable performance of ChatGPT, which often rivals or even surpasses human capabilities in various tasks and domains. However, this paper presents a contrasting perspective by demonstrating an instance where human performance excels in typical tasks suited for ChatGPT, specifically in the domain of computer programming. We util… ▽ More Since the release of ChatGPT, numerous studies have highlighted the remarkable performance of ChatGPT, which often rivals or even surpasses human capabilities in various tasks and domains. However, this paper presents a contrasting perspective by demonstrating an instance where human performance excels in typical tasks suited for ChatGPT, specifically in the domain of computer programming. We utilize the IEEExtreme Challenge competition as a benchmark, a prestigious, annual international programming contest encompassing a wide range of problems with different complexities. To conduct a thorough evaluation, we selected and executed a diverse set of 102 challenges, drawn from five distinct IEEExtreme editions, using three major programming languages: Python, Java, and C++. Our empirical analysis provides evidence that contrary to popular belief, human programmers maintain a competitive edge over ChatGPT in certain aspects of problem-solving within the programming context. In fact, we found that the average score obtained by ChatGPT on the set of IEEExtreme programming problems is 3.9 to 5.8 times lower than the average human score, depending on the programming language. This paper elaborates on these findings, offering critical insights into the limitations and potential areas of improvement for AI-based language models like ChatGPT. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: 9 pages, 3 figures

arXiv:2304.09756 [pdf, other]

Contactless Human Activity Recognition using Deep Learning with Flexible and Scalable Software Define Radio

Authors: Muhammad Zakir Khan, Jawad Ahmad, Wadii Boulila, Matthew Broadbent, Syed Aziz Shah, Anis Koubaa, Qammer H. Abbasi

Abstract: Ambient computing is gaining popularity as a major technological advancement for the future. The modern era has witnessed a surge in the advancement in healthcare systems, with viable radio frequency solutions proposed for remote and unobtrusive human activity recognition (HAR). Specifically, this study investigates the use of Wi-Fi channel state information (CSI) as a novel method of ambient sens… ▽ More Ambient computing is gaining popularity as a major technological advancement for the future. The modern era has witnessed a surge in the advancement in healthcare systems, with viable radio frequency solutions proposed for remote and unobtrusive human activity recognition (HAR). Specifically, this study investigates the use of Wi-Fi channel state information (CSI) as a novel method of ambient sensing that can be employed as a contactless means of recognizing human activity in indoor environments. These methods avoid additional costly hardware required for vision-based systems, which are privacy-intrusive, by (re)using Wi-Fi CSI for various safety and security applications. During an experiment utilizing universal software-defined radio (USRP) to collect CSI samples, it was observed that a subject engaged in six distinct activities, which included no activity, standing, sitting, and leaning forward, across different areas of the room. Additionally, more CSI samples were collected when the subject walked in two different directions. This study presents a Wi-Fi CSI-based HAR system that assesses and contrasts deep learning approaches, namely convolutional neural network (CNN), long short-term memory (LSTM), and hybrid (LSTM+CNN), employed for accurate activity recognition. The experimental results indicate that LSTM surpasses current models and achieves an average accuracy of 95.3% in multi-activity classification when compared to CNN and hybrid techniques. In the future, research needs to study the significance of resilience in diverse and dynamic environments to identify the activity of multiple users. △ Less

Submitted 18 April, 2023; originally announced April 2023.

arXiv:2304.04161 [pdf]

Detection of COVID19 in Chest X-Ray Images Using Transfer Learning

Authors: Zanoby N. Khan

Abstract: COVID19 is a highly contagious disease infected millions of people worldwide. With limited testing components, screening tools such as chest radiography can assist the clinicians in the diagnosis and assessing the progress of disease. The performance of deep learning-based systems for diagnosis of COVID-19 disease in radiograph images has been encouraging. This paper investigates the concept of tr… ▽ More COVID19 is a highly contagious disease infected millions of people worldwide. With limited testing components, screening tools such as chest radiography can assist the clinicians in the diagnosis and assessing the progress of disease. The performance of deep learning-based systems for diagnosis of COVID-19 disease in radiograph images has been encouraging. This paper investigates the concept of transfer learning using two of the most well-known VGGNet architectures, namely VGG-16 and VGG-19. The classifier block and hyperparameters are fine-tuned to adopt the models for automatic detection of Covid-19 in chest x-ray images. We generated two different datasets to evaluate the performance of the proposed system for the identification of positive Covid-19 instances in a multiclass and binary classification problems. The experimental outcome demonstrates the usefulness of transfer learning for small-sized datasets particularly in the field of medical imaging, not only to prevent over-fitting and convergence problems but also to attain optimal classification performance as well. △ Less

Submitted 9 April, 2023; originally announced April 2023.

arXiv:2304.03561 [pdf]

Diversity Preserving, Universal Hard Decision Decoder for Linear Block Codes

Authors: Praveen Sai Bere, Mohammed Zafar Ali Khan

Abstract: Hard-decision decoding does not preserve the diversity order. This results in severe performance degradation in fading channels. In contrast, soft-decision decoding preserves the diversity order at an impractical computational complexity. For a linear block code $\mathscr{C}(n,k)$ of length $n$ and dimension $k$, the complexity of soft-decision decoding is of the order of $2^k$. This paper pro… ▽ More Hard-decision decoding does not preserve the diversity order. This results in severe performance degradation in fading channels. In contrast, soft-decision decoding preserves the diversity order at an impractical computational complexity. For a linear block code $\mathscr{C}(n,k)$ of length $n$ and dimension $k$, the complexity of soft-decision decoding is of the order of $2^k$. This paper proposes a novel hard-decision decoder named Flip decoder (FD), which preserves the diversity order. Further, the proposed Flip decoder is `universally' applicable to all linear block codes. For a code $\mathscr{C}(n,k)$, with a minimum distance ${d_{\min}}$, the proposed decoder has a complexity of the order of $2^{({d_{\min}}-1)}$. For low ${d_{\min}}$ codes, this complexity is meager compared to known soft and hard decision decoding algorithms. As it also preserves diversity, it is suitable for IoT, URLLC, WBAN, and other similar applications. Simulation results and comparisons are provided for various known codes. These simulations corroborate and emphasize the practicality of the proposed decoder. △ Less

Submitted 7 April, 2023; originally announced April 2023.

Comments: Transacton of 10 pages with 4 figures

arXiv:2303.12210 [pdf, ps, other]

A Random Projection k Nearest Neighbours Ensemble for Classification via Extended Neighbourhood Rule

Authors: Amjad Ali, Muhammad Hamraz, Dost Muhammad Khan, Wajdan Deebani, Zardad Khan

Abstract: Ensembles based on k nearest neighbours (kNN) combine a large number of base learners, each constructed on a sample taken from a given training data. Typical kNN based ensembles determine the k closest observations in the training data bounded to a test sample point by a spherical region to predict its class. In this paper, a novel random projection extended neighbourhood rule (RPExNRule) ensemble… ▽ More Ensembles based on k nearest neighbours (kNN) combine a large number of base learners, each constructed on a sample taken from a given training data. Typical kNN based ensembles determine the k closest observations in the training data bounded to a test sample point by a spherical region to predict its class. In this paper, a novel random projection extended neighbourhood rule (RPExNRule) ensemble is proposed where bootstrap samples from the given training data are randomly projected into lower dimensions for additional randomness in the base models and to preserve features information. It uses the extended neighbourhood rule (ExNRule) to fit kNN as base learners on randomly projected bootstrap samples. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: 23 pages, 8 diagrams, 69 references

ACM Class: F.2.2

arXiv:2303.11866 [pdf, other]

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

Authors: Zaid Khan, Yun Fu

Abstract: Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of paramet… ▽ More Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates ($<$7%) can achieve the same performance as full-model training, and updating specific components ($<$1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energy-efficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at https://github.com/codezakh/LilT. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: Accepted to ICLR 2023

arXiv:2303.00337 [pdf, other]

doi 10.1016/j.engappai.2022.105095

TAU: A Framework for Video-Based Traffic Analytics Leveraging Artificial Intelligence and Unmanned Aerial Systems

Authors: Bilel Benjdira, Anis Koubaa, Ahmad Taher Azar, Zahid Khan, Adel Ammar, Wadii Boulila

Abstract: Smart traffic engineering and intelligent transportation services are in increasing demand from governmental authorities to optimize traffic performance and thus reduce energy costs, increase the drivers' safety and comfort, ensure traffic laws enforcement, and detect traffic violations. In this paper, we address this challenge, and we leverage the use of Artificial Intelligence (AI) and Unmanned… ▽ More Smart traffic engineering and intelligent transportation services are in increasing demand from governmental authorities to optimize traffic performance and thus reduce energy costs, increase the drivers' safety and comfort, ensure traffic laws enforcement, and detect traffic violations. In this paper, we address this challenge, and we leverage the use of Artificial Intelligence (AI) and Unmanned Aerial Vehicles (UAVs) to develop an AI-integrated video analytics framework, called TAU (Traffic Analysis from UAVs), for automated traffic analytics and understanding. Unlike previous works on traffic video analytics, we propose an automated object detection and tracking pipeline from video processing to advanced traffic understanding using high-resolution UAV images. TAU combines six main contributions. First, it proposes a pre-processing algorithm to adapt the high-resolution UAV image as input to the object detector without lowering the resolution. This ensures an excellent detection accuracy from high-quality features, particularly the small size of detected objects from UAV images. Second, it introduces an algorithm for recalibrating the vehicle coordinates to ensure that vehicles are uniquely identified and tracked across the multiple crops of the same frame. Third, it presents a speed calculation algorithm based on accumulating information from successive frames. Fourth, TAU counts the number of vehicles per traffic zone based on the Ray Tracing algorithm. Fifth, TAU has a fully independent algorithm for crossroad arbitration based on the data gathered from the different zones surrounding it. Sixth, TAU introduces a set of algorithms for extracting twenty-four types of insights from the raw data collected. The code is shared here: https://github.com/bilel-bj/TAU. Video demonstrations are provided here: https://youtu.be/wXJV0H7LviU and here: https://youtu.be/kGv0gmtVEbI. △ Less

Submitted 1 March, 2023; originally announced March 2023.

Comments: This is the final proofread version submitted to Elsevier EAAI: please see the published version at: https://doi.org/10.1016/j.engappai.2022.105095

Journal ref: Engineering Applications of Artificial Intelligence, Volume 114, 2022, 105095, ISSN 0952-1976

arXiv:2302.10978 [pdf, other]

Learning to Retrieve Engaging Follow-Up Queries

Authors: Christopher Richardson, Sudipta Kar, Anjishnu Kumar, Anand Ramachandran, Omar Zia Khan, Zeynab Raeesy, Abhinav Sethy

Abstract: Open domain conversational agents can answer a broad range of targeted queries. However, the sequential nature of interaction with these systems makes knowledge exploration a lengthy task which burdens the user with asking a chain of well phrased questions. In this paper, we present a retrieval based system and associated dataset for predicting the next questions that the user might have. Such a s… ▽ More Open domain conversational agents can answer a broad range of targeted queries. However, the sequential nature of interaction with these systems makes knowledge exploration a lengthy task which burdens the user with asking a chain of well phrased questions. In this paper, we present a retrieval based system and associated dataset for predicting the next questions that the user might have. Such a system can proactively assist users in knowledge exploration leading to a more engaging dialog. The retrieval system is trained on a dataset which contains ~14K multi-turn information-seeking conversations with a valid follow-up question and a set of invalid candidates. The invalid candidates are generated to simulate various syntactic and semantic confounders such as paraphrases, partial entity match, irrelevant entity, and ASR errors. We use confounder specific techniques to simulate these negative examples on the OR-QuAC dataset and develop a dataset called the Follow-up Query Bank (FQ-Bank). Then, we train ranking models on FQ-Bank and present results comparing supervised and unsupervised approaches. The results suggest that we can retrieve the valid follow-ups by ranking them in higher positions compared to confounders, but further knowledge grounding can improve ranking performance. △ Less

Submitted 21 February, 2023; originally announced February 2023.

Comments: EACL 2023

arXiv:2212.02291 [pdf, other]

I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

Authors: Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, Federico Tombari

Abstract: Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowled… ▽ More Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class(referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.11278 [pdf, ps, other]

Optimal Extended Neighbourhood Rule $k$ Nearest Neighbours Ensemble

Authors: Amjad Ali, Zardad Khan, Dost Muhammad Khan, Saeed Aldahmani

Abstract: The traditional k nearest neighbor (kNN) approach uses a distance formula within a spherical region to determine the k closest training observations to a test sample point. However, this approach may not work well when test point is located outside this region. Moreover, aggregating many base kNN learners can result in poor ensemble performance due to high classification errors. To address these i… ▽ More The traditional k nearest neighbor (kNN) approach uses a distance formula within a spherical region to determine the k closest training observations to a test sample point. However, this approach may not work well when test point is located outside this region. Moreover, aggregating many base kNN learners can result in poor ensemble performance due to high classification errors. To address these issues, a new optimal extended neighborhood rule based ensemble method is proposed in this paper. This rule determines neighbors in k steps starting from the closest sample point to the unseen observation and selecting subsequent nearest data points until the required number of observations is reached. Each base model is constructed on a bootstrap sample with a random subset of features, and optimal models are selected based on out-of-bag performance after building a sufficient number of models. The proposed ensemble is compared with state-of-the-art methods on 17 benchmark datasets using accuracy, Cohen's kappa, and Brier score (BS). The performance of the proposed method is also assessed by adding contrived features in the original data. △ Less

Submitted 15 February, 2024; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: This manuscript has been submitted for publication in the esteemed journal Pattern Recognition Letters

MSC Class: 14J60

arXiv:2210.11557 [pdf, other]

Learning Attention Propagation for Compositional Zero-Shot Learning

Authors: Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Alain Pagani, Didier Stricker, Muhammad Zeshan Afzal

Abstract: Compositional zero-shot learning aims to recognize unseen compositions of seen visual primitives of object classes and their states. While all primitives (states and objects) are observable during training in some combination, their complex interaction makes this task especially hard. For example, wet changes the visual appearance of a dog very differently from a bicycle. Furthermore, we argue tha… ▽ More Compositional zero-shot learning aims to recognize unseen compositions of seen visual primitives of object classes and their states. While all primitives (states and objects) are observable during training in some combination, their complex interaction makes this task especially hard. For example, wet changes the visual appearance of a dog very differently from a bicycle. Furthermore, we argue that relationships between compositions go beyond shared states or objects. A cluttered office can contain a busy table; even though these compositions don't share a state or object, the presence of a busy table can guide the presence of a cluttered office. We propose a novel method called Compositional Attention Propagated Embedding (CAPE) as a solution. The key intuition to our method is that a rich dependency structure exists between compositions arising from complex interactions of primitives in addition to other dependencies between compositions. CAPE learns to identify this structure and propagates knowledge between them to learn class embedding for all seen and unseen compositions. In the challenging generalized compositional zero-shot setting, we show that our method outperforms previous baselines to set a new state-of-the-art on three publicly available benchmarks. △ Less

Submitted 20 October, 2022; originally announced October 2022.

arXiv:2210.10828 [pdf, other]

Grounded Video Situation Recognition

Authors: Zeeshan Khan, C. V. Jawahar, Makarand Tapaswi

Abstract: Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambigua… ▽ More Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatio-temporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: Accepted to NeurIPS 2022. Project Page: https://zeeshank95.github.io/grvidsitu

arXiv:2210.04429 [pdf, other]

DeepHS-HDRVideo: Deep High Speed High Dynamic Range Video Reconstruction

Authors: Zeeshan Khan, Parth Shettiwar, Mukul Khanna, Shanmuganathan Raman

Abstract: Due to hardware constraints, standard off-the-shelf digital cameras suffers from low dynamic range (LDR) and low frame per second (FPS) outputs. Previous works in high dynamic range (HDR) video reconstruction uses sequence of alternating exposure LDR frames as input, and align the neighbouring frames using optical flow based networks. However, these methods often result in motion artifacts in chal… ▽ More Due to hardware constraints, standard off-the-shelf digital cameras suffers from low dynamic range (LDR) and low frame per second (FPS) outputs. Previous works in high dynamic range (HDR) video reconstruction uses sequence of alternating exposure LDR frames as input, and align the neighbouring frames using optical flow based networks. However, these methods often result in motion artifacts in challenging situations. This is because, the alternate exposure frames have to be exposure matched in order to apply alignment using optical flow. Hence, over-saturation and noise in the LDR frames results in inaccurate alignment. To this end, we propose to align the input LDR frames using a pre-trained video frame interpolation network. This results in better alignment of LDR frames, since we circumvent the error-prone exposure matching step, and directly generate intermediate missing frames from the same exposure inputs. Furthermore, it allows us to generate high FPS HDR videos by recursively interpolating the intermediate frames. Through this work, we propose to use video frame interpolation for HDR video reconstruction, and present the first method to generate high FPS HDR videos. Experimental results demonstrate the efficacy of the proposed framework against optical flow based alignment methods, with an absolute improvement of 2.4 PSNR value on standard HDR video datasets [1], [2] and further benchmark our method for high FPS HDR video generation. △ Less

Submitted 10 October, 2022; originally announced October 2022.

Comments: ICPR 2022

arXiv:2210.03168 [pdf, other]

doi 10.1109/IEMCON56893.2022.9946531

Gastrointestinal Disorder Detection with a Transformer Based Approach

Authors: A. K. M. Salman Hosain, Mynul islam, Md Humaion Kabir Mehedi, Irteza Enan Kabir, Zarin Tasnim Khan

Abstract: Accurate disease categorization using endoscopic images is a significant problem in Gastroenterology. This paper describes a technique for assisting medical diagnosis procedures and identifying gastrointestinal tract disorders based on the categorization of characteristics taken from endoscopic pictures using a vision transformer and transfer learning model. Vision transformer has shown very promi… ▽ More Accurate disease categorization using endoscopic images is a significant problem in Gastroenterology. This paper describes a technique for assisting medical diagnosis procedures and identifying gastrointestinal tract disorders based on the categorization of characteristics taken from endoscopic pictures using a vision transformer and transfer learning model. Vision transformer has shown very promising results on difficult image classification tasks. In this paper, we have suggested a vision transformer based approach to detect gastrointestianl diseases from wireless capsule endoscopy (WCE) curated images of colon with an accuracy of 95.63\%. We have compared this transformer based approach with pretrained convolutional neural network (CNN) model DenseNet201 and demonstrated that vision transformer surpassed DenseNet201 in various quantitative performance evaluation metrics. △ Less

Submitted 6 October, 2022; originally announced October 2022.

arXiv:2206.12481 [pdf, other]

Analyzing Explainer Robustness via Probabilistic Lipschitzness of Prediction Functions

Authors: Zulqarnain Khan, Davin Hill, Aria Masoomi, Joshua Bone, Jennifer Dy

Abstract: Machine learning methods have significantly improved in their predictive capabilities, but at the same time they are becoming more complex and less transparent. As a result, explainers are often relied on to provide interpretability to these black-box prediction models. As crucial diagnostics tools, it is important that these explainers themselves are robust. In this paper we focus on one particul… ▽ More Machine learning methods have significantly improved in their predictive capabilities, but at the same time they are becoming more complex and less transparent. As a result, explainers are often relied on to provide interpretability to these black-box prediction models. As crucial diagnostics tools, it is important that these explainers themselves are robust. In this paper we focus on one particular aspect of robustness, namely that an explainer should give similar explanations for similar data inputs. We formalize this notion by introducing and defining explainer astuteness, analogous to astuteness of prediction functions. Our formalism allows us to connect explainer robustness to the predictor's probabilistic Lipschitzness, which captures the probability of local smoothness of a function. We provide lower bound guarantees on the astuteness of a variety of explainers (e.g., SHAP, RISE, CXPlain) given the Lipschitzness of the prediction function. These theoretical results imply that locally smooth prediction functions lend themselves to locally robust explanations. We evaluate these results empirically on simulated as well as real datasets. △ Less

Submitted 16 April, 2024; v1 submitted 24 June, 2022; originally announced June 2022.

arXiv:2205.15111 [pdf, ps, other]

A k nearest neighbours classifiers ensemble based on extended neighbourhood rule and features subsets

Authors: Amjad Ali, Muhammad Hamraz, Naz Gul, Dost Muhammad Khan, Zardad Khan, Saeed Aldahmani

Abstract: kNN based ensemble methods minimise the effect of outliers by identifying a set of data points in the given feature space that are nearest to an unseen observation in order to predict its response by using majority voting. The ordinary ensembles based on kNN find out the k nearest observations in a region (bounded by a sphere) based on a predefined value of k. This scenario, however, might not wor… ▽ More kNN based ensemble methods minimise the effect of outliers by identifying a set of data points in the given feature space that are nearest to an unseen observation in order to predict its response by using majority voting. The ordinary ensembles based on kNN find out the k nearest observations in a region (bounded by a sphere) based on a predefined value of k. This scenario, however, might not work in situations when the test observation follows the pattern of the closest data points with the same class that lie on a certain path not contained in the given sphere. This paper proposes a k nearest neighbour ensemble where the neighbours are determined in k steps. Starting from the first nearest observation of the test point, the algorithm identifies a single observation that is closest to the observation at the previous step. At each base learner in the ensemble, this search is extended to k steps on a random bootstrap sample with a random subset of features selected from the feature space. The final predicted class of the test point is determined by using a majority vote in the predicted classes given by all base models. This new ensemble method is applied on 17 benchmark datasets and compared with other classical methods, including kNN based models, in terms of classification accuracy, kappa and Brier score as performance metrics. Boxplots are also utilised to illustrate the difference in the results given by the proposed and other state-of-the-art methods. The proposed method outperformed the rest of the classical methods in the majority of cases. The paper gives a detailed simulation study for further assessment. △ Less

Submitted 30 May, 2022; originally announced May 2022.

Comments: This paper is submitted to pattern recognotion and has 26 pages, 9 figures and 5 tables

arXiv:2205.01225 [pdf]

A Hybrid Defense Method against Adversarial Attacks on Traffic Sign Classifiers in Autonomous Vehicles

Authors: Zadid Khan, Mashrur Chowdhury, Sakib Mahmud Khan

Abstract: Adversarial attacks can make deep neural network (DNN) models predict incorrect output labels, such as misclassified traffic signs, for autonomous vehicle (AV) perception modules. Resilience against adversarial attacks can help AVs navigate safely on the road by avoiding misclassication of signs or objects. This DNN-based study develops a resilient traffic sign classifier for AVs that uses a hybri… ▽ More Adversarial attacks can make deep neural network (DNN) models predict incorrect output labels, such as misclassified traffic signs, for autonomous vehicle (AV) perception modules. Resilience against adversarial attacks can help AVs navigate safely on the road by avoiding misclassication of signs or objects. This DNN-based study develops a resilient traffic sign classifier for AVs that uses a hybrid defense method. We use transfer learning to retrain the Inception-V3 and Resnet-152 models as traffic sign classifiers. This method also utilizes a combination of three different strategies: random filtering, ensembling, and local feature mapping. We use the random cropping and resizing technique for random filtering, plurality voting as ensembling strategy and an optical character recognition model as a local feature mapper. This DNN-based hybrid defense method has been tested for the no attack scenario and against well-known untargeted adversarial attacks (e.g., Projected Gradient Descent or PGD, Fast Gradient Sign Method or FGSM, Momentum Iterative Method or MIM attack, and Carlini and Wagner or C&W). We find that our hybrid defense method achieves 99% average traffic sign classification accuracy for the no attack scenario and 88% average traffic sign classification accuracy for all attack scenarios. Moreover, the hybrid defense method, presented in this study, improves the accuracy for traffic sign classification compared to the traditional defense methods (i.e., JPEG filtering, feature squeezing, binary filtering, and random filtering) up to 6%, 50%, and 55% for FGSM, MIM, and PGD attacks, respectively. △ Less

Submitted 24 April, 2022; originally announced May 2022.

Comments: 13 pages, 8 figures

arXiv:2203.14395 [pdf, other]

Single-Stream Multi-Level Alignment for Vision-Language Pretraining

Authors: Zaid Khan, Vijay Kumar BG, Xiang Yu, Samuel Schulter, Manmohan Chandraker, Yun Fu

Abstract: Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, but required dense annotations that were not scalable. We propose a si… ▽ More Self-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to a dual-stream architecture that aligns image and text representations only on a global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, but required dense annotations that were not scalable. We propose a single stream architecture that aligns images and language at multiple levels: global, fine-grained patch-token, and conceptual/semantic, using two novel tasks: symmetric cross-modality reconstruction (XMM) and a pseudo-labeled key word prediction (PSL). In XMM, we mask input tokens from one modality and use cross-modal information to reconstruct the masked token, thus improving fine-grained alignment between the two modalities. In PSL, we use attention to select keywords in a caption, use a momentum encoder to recommend other important keywords that are missing from the caption but represented in the image, and then train the visual encoder to predict the presence of those keywords, helping it learn semantic concepts that are essential for grounding a textual token to an image region. We demonstrate competitive performance and improved data efficiency on image-text retrieval, grounding, visual question answering/reasoning against larger models and models trained on more data. Code and models available at zaidkhan.me/SIMLA. △ Less

Submitted 27 July, 2022; v1 submitted 27 March, 2022; originally announced March 2022.

Comments: ECCV 2022

arXiv:2203.09348 [pdf, other]

doi 10.1109/SMARTTECH54121.2022.00018

POSTER: Diagnosis of COVID-19 through Transfer Learning Techniques on CT Scans: A Comparison of Deep Learning Models

Authors: Aeyan Ashraf, Asad Malik, Zahid Khan

Abstract: The novel coronavirus disease (COVID-19) constitutes a public health emergency globally. It is a deadly disease which has infected more than 230 million people worldwide. Therefore, early and unswerving detection of COVID-19 is necessary. Evidence of this virus is most commonly being tested by RT-PCR test. This test is not 100% reliable as it is known to give false positives and false negatives. O… ▽ More The novel coronavirus disease (COVID-19) constitutes a public health emergency globally. It is a deadly disease which has infected more than 230 million people worldwide. Therefore, early and unswerving detection of COVID-19 is necessary. Evidence of this virus is most commonly being tested by RT-PCR test. This test is not 100% reliable as it is known to give false positives and false negatives. Other methods like X-Ray images or CT scans show the detailed imaging of lungs and have been proven more reliable. This paper compares different deep learning models used to detect COVID-19 through transfer learning technique on CT scan dataset. VGG-16 outperforms all the other models achieving an accuracy of 85.33% on the dataset. △ Less

Submitted 17 March, 2022; originally announced March 2022.

Journal ref: 2022 2nd International Conference of Smart Systems and Emerging Technologies (SMARTTECH)

arXiv:2202.04517 [pdf, other]

doi 10.1016/j.compmedimag.2022.102121

A Neural Network based Framework for Effective Laparoscopic Video Quality Assessment

Authors: Zohaib Amjad Khan, Azeddine Beghdadi, Mounir Kaaniche, Faouzi Alaya Cheikh, Osama Gharbi

Abstract: Video quality assessment is a challenging problem having a critical significance in the context of medical imaging. For instance, in laparoscopic surgery, the acquired video data suffers from different kinds of distortion that not only hinder surgery performance but also affect the execution of subsequent tasks in surgical navigation and robotic surgeries. For this reason, we propose in this paper… ▽ More Video quality assessment is a challenging problem having a critical significance in the context of medical imaging. For instance, in laparoscopic surgery, the acquired video data suffers from different kinds of distortion that not only hinder surgery performance but also affect the execution of subsequent tasks in surgical navigation and robotic surgeries. For this reason, we propose in this paper neural network-based approaches for distortion classification as well as quality prediction. More precisely, a Residual Network (ResNet) based approach is firstly developed for simultaneous ranking and classification task. Then, this architecture is extended to make it appropriate for the quality prediction task by using an additional Fully Connected Neural Network (FCNN). To train the overall architecture (ResNet and FCNN models), transfer learning and end-to-end learning approaches are investigated. Experimental results, carried out on a new laparoscopic video quality database, have shown the efficiency of the proposed methods compared to recent conventional and deep learning based approaches. △ Less

Submitted 14 April, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

arXiv:2201.06180 [pdf, other]

Nonlinear Control Allocation: A Learning Based Approach

Authors: Hafiz Zeeshan Iqbal Khan, Surrayya Mobeen, Jahanzeb Rajput, Jamshed Riaz

Abstract: Modern aircraft are designed with redundant control effectors to cater for fault tolerance and maneuverability requirements. This leads to aircraft being over-actuated and requires control allocation schemes to distribute the control commands among control effectors. Traditionally, optimization-based control allocation schemes are used; however, for nonlinear allocation problems, these methods req… ▽ More Modern aircraft are designed with redundant control effectors to cater for fault tolerance and maneuverability requirements. This leads to aircraft being over-actuated and requires control allocation schemes to distribute the control commands among control effectors. Traditionally, optimization-based control allocation schemes are used; however, for nonlinear allocation problems, these methods require large computational resources. In this work, an artificial neural network (ANN) based nonlinear control allocation scheme is proposed. The proposed scheme is composed of learning the inverse of the control effectiveness map through ANN, and then implementing it as an allocator instead of solving an online optimization problem. Stability conditions are presented for closed-loop systems incorporating the allocator, and computational challenges are explored with piece-wise linear effectiveness functions and ANN-based allocators. To demonstrate the efficacy of the proposed scheme, it is compared with a standard quadratic programming-based method for control allocation. △ Less

Submitted 27 March, 2024; v1 submitted 16 January, 2022; originally announced January 2022.

Comments: submitted to IEEE Conference on Decision and Control (CDC), 2024

arXiv:2112.09296 [pdf, other]

A Survey on the Applications of Blockchains in Security of IoT Systems

Authors: Zulfiqar Ali Khan, Akbar Siami Namin

Abstract: The Internet of Things (IoT) has already changed our daily lives by integrating smart devices together towards delivering high quality services to its clients. These devices when integrated together form a network through which massive amount of data can be produced, transferred, and shared. A critical concern is the security and integrity of such a complex platform to ensure the sustainability an… ▽ More The Internet of Things (IoT) has already changed our daily lives by integrating smart devices together towards delivering high quality services to its clients. These devices when integrated together form a network through which massive amount of data can be produced, transferred, and shared. A critical concern is the security and integrity of such a complex platform to ensure the sustainability and reliability of these IoT-based systems. Blockchain is an emerging technology that has demonstrated its unique features and capabilities for different problems and application domains including IoT-based systems. This survey paper reviews the adaptation of Blockchain in the context of IoT to represent how this technology is capable of addressing the integration and security problems of devices connected to IoT systems. The innovation of this survey is that we present a survey based upon the integration approaches and security issues of IoT data and discuss the role of Blockchain in connection with these issues. △ Less

Submitted 16 December, 2021; originally announced December 2021.

Comments: 15 pages, IEEE Bigdata 2021

arXiv:2112.03796 [pdf, other]

doi 10.1109/JLT.2022.3148322

New Lower Bounds on the Capacity of Optical Fiber Channels via Optimized Shaping and Detection

Authors: Marco Secondini, Stella Civelli, Enrico Forestieri, Lareb Zar Khan

Abstract: Constellation shaping is a practical and effective technique to improve the performance and the rate adaptivity of optical communication systems. In principle, it could also be used to mitigate the impact of nonlinear effects, possibly increasing the information rate beyond the current limit dictated by fiber nonlinearity. However, this appealing idea is frustrated by the difficulty of designing a… ▽ More Constellation shaping is a practical and effective technique to improve the performance and the rate adaptivity of optical communication systems. In principle, it could also be used to mitigate the impact of nonlinear effects, possibly increasing the information rate beyond the current limit dictated by fiber nonlinearity. However, this appealing idea is frustrated by the difficulty of designing an effective shaping strategy that takes into account the nonlinearity and long memory of the fiber channel, as well as the possible interplay with other nonlinearity mitigation strategies. As a result, only little progress has been made so far, while the optimal shaping distribution and the ultimate channel capacity remain unknown. In this work, we describe a novel technique to optimize the shaping distribution in a very general setting and high-dimensional space. For a simplified block-memoryless nonlinear optical channel, the capacity lower bound obtained by the proposed technique can be expressed analytically, establishing the conditions for an unbounded growth of capacity with power. In a more realistic scenario, the technique can be implemented by a rejection sampling algorithm driven by a suitable cost function, and the corresponding achievable information rate estimated numerically. The combination of the proposed technique with an improved (non-Gaussian) decoding metric yields a new capacity lower bound for the dual-polarization WDM channel. △ Less

Submitted 7 December, 2021; originally announced December 2021.

Comments: Submitted to IEEE Journal of Lightwave Technology on November 30th, 2021

arXiv:2110.07467 [pdf]

Hybrid Quantum-Classical Neural Network for Cloud-supported In-Vehicle Cyberattack Detection

Authors: Mhafuzul Islam, Mashrur Chowdhury, Zadid Khan, Sakib Mahmud Khan

Abstract: A classical computer works with ones and zeros, whereas a quantum computer uses ones, zeros, and superpositions of ones and zeros, which enables quantum computers to perform a vast number of calculations simultaneously compared to classical computers. In a cloud-supported cyber-physical system environment, running a machine learning application in quantum computers is often difficult, due to the e… ▽ More A classical computer works with ones and zeros, whereas a quantum computer uses ones, zeros, and superpositions of ones and zeros, which enables quantum computers to perform a vast number of calculations simultaneously compared to classical computers. In a cloud-supported cyber-physical system environment, running a machine learning application in quantum computers is often difficult, due to the existing limitations of the current quantum devices. However, with the combination of quantum-classical neural networks (NN), complex and high-dimensional features can be extracted by the classical NN to a reduced but more informative feature space to be processed by the existing quantum computers. In this study, we develop a hybrid quantum-classical NN to detect an amplitude shift cyber-attack on an in-vehicle control area network (CAN) dataset. We show that using the hybrid quantum classical NN, it is possible to achieve an attack detection accuracy of 94%, which is higher than a Long short-term memory (LSTM) NN (87%) or quantum NN alone (62%) △ Less

Submitted 14 October, 2021; originally announced October 2021.

Comments: 4 pages, 3 figures

arXiv:2109.12149 [pdf, other]

doi 10.1109/IBCAST54850.2022.9990112

Airfoil's Aerodynamic Coefficients Prediction using Artificial Neural Network

Authors: Hassan Moin, Hafiz Zeeshan Iqbal Khan, Surrayya Mobeen, Jamshed Riaz

Abstract: Figuring out the right airfoil is a crucial step in the preliminary stage of any aerial vehicle design, as its shape directly affects the overall aerodynamic characteristics of the aircraft or rotorcraft. Besides being a measure of performance, the aerodynamic coefficients are used to design additional subsystems such as a flight control system, or predict complex dynamic phenomena such as aeroela… ▽ More Figuring out the right airfoil is a crucial step in the preliminary stage of any aerial vehicle design, as its shape directly affects the overall aerodynamic characteristics of the aircraft or rotorcraft. Besides being a measure of performance, the aerodynamic coefficients are used to design additional subsystems such as a flight control system, or predict complex dynamic phenomena such as aeroelastic instability. The coefficients in question can either be obtained experimentally through wind tunnel testing or, depending upon the accuracy requirements, by numerically simulating the underlying fundamental equations of fluid dynamics. In this paper, the feasibility of applying Artificial Neural Networks (ANNs) to estimate the aerodynamic coefficients of differing airfoil geometries at varying Angle of Attack, Mach and Reynolds number is investigated. The ANNs are computational entities that have the ability to learn highly nonlinear spatial and temporal patterns. Therefore, they are increasingly being used to approximate complex real-world phenomenon. However, despite their significant breakthrough in the past few years, ANNs' spreading in the field of Computational Fluid Dynamics (CFD) is fairly recent, and many applications within this field remain unexplored. This study thus compares different network architectures and training datasets in an attempt to gain insight as to how the network perceives the given airfoil geometries, while producing an acceptable neuronal model for faster and easier prediction of lift, drag and moment coefficients in steady state, incompressible flow regimes. This data-driven method produces sufficiently accurate results, with the added benefit of saving high computational and experimental costs. △ Less

Submitted 24 September, 2021; originally announced September 2021.

Journal ref: 2022 19th International Bhurban Conference on Applied Sciences and Technology (IBCAST)

arXiv:2108.05781 [pdf, other]

Networked Twins and Twins of Networks: an Overview on the Relationship Between Digital Twins and 6G

Authors: Hamed Ahmadi, Avishek Nag, Zaheer Khan, Kamran Sayrafian, Susanto Rahadrja

Abstract: Digital Twin (DT) is a promising technology for the new immersive digital life with a variety of applications in areas such as Industry 4.0, aviation, and healthcare. Proliferation of this technology requires higher data rates, reliability, resilience, and lower latency beyond what is currently offered by 5G. Thus, DT can become a major driver for 6G research and development. Alternatively, 6G net… ▽ More Digital Twin (DT) is a promising technology for the new immersive digital life with a variety of applications in areas such as Industry 4.0, aviation, and healthcare. Proliferation of this technology requires higher data rates, reliability, resilience, and lower latency beyond what is currently offered by 5G. Thus, DT can become a major driver for 6G research and development. Alternatively, 6G network development can benefit from Digital Twin technology and its powerful features such as modularity and remote intelligence. Using DT, a 6G network (or some of its components) will have the opportunity to use Artificial Intelligence more proactively in order to enhance its resilience. DT's application in telecommunications is still in its infancy. In this article we highlight some of the most promising research and development directions for this technology. △ Less

Submitted 12 August, 2021; originally announced August 2021.

Comments: Accepted for publication at IEEE Communications Standards Magazine

arXiv:2108.01682 [pdf, other]

doi 10.1145/3474085.3475692

Exploiting BERT For Multimodal Target Sentiment Classification Through Input Space Translation

Authors: Zaid Khan, Yun Fu

Abstract: Multimodal target/aspect sentiment classification combines multimodal sentiment analysis and aspect/target sentiment classification. The goal of the task is to combine vision and language to understand the sentiment towards a target entity in a sentence. Twitter is an ideal setting for the task because it is inherently multimodal, highly emotional, and affects real world events. However, multimoda… ▽ More Multimodal target/aspect sentiment classification combines multimodal sentiment analysis and aspect/target sentiment classification. The goal of the task is to combine vision and language to understand the sentiment towards a target entity in a sentence. Twitter is an ideal setting for the task because it is inherently multimodal, highly emotional, and affects real world events. However, multimodal tweets are short and accompanied by complex, possibly irrelevant images. We introduce a two-stream model that translates images in input space using an object-aware transformer followed by a single-pass non-autoregressive text generation approach. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. Our approach increases the amount of text available to the language model and distills the object-level information in complex images. We achieve state-of-the-art performance on two multimodal Twitter datasets without modifying the internals of the language model to accept multimodal data, demonstrating the effectiveness of our translation. In addition, we explain a failure mode of a popular approach for aspect sentiment analysis when applied to tweets. Our code is available at \textcolor{blue}{\url{https://github.com/codezakh/exploiting-BERT-thru-translation}}. △ Less

Submitted 5 August, 2021; v1 submitted 3 August, 2021; originally announced August 2021.

Comments: ACM Multimedia 2021 Oral

arXiv:2108.01127 [pdf]

Hybrid Quantum-Classical Neural Network for Incident Detection

Authors: Zadid Khan, Sakib Mahmud Khan, Jean Michel Tine, Ayse Turhan Comert, Diamon Rice, Gurcan Comert, Dimitra Michalaka, Judith Mwakalonge, Reek Majumdar, Mashrur Chowdhury

Abstract: The efficiency and reliability of real-time incident detection models directly impact the affected corridors' traffic safety and operational conditions. The recent emergence of cloud-based quantum computing infrastructure and innovations in noisy intermediate-scale quantum devices have revealed a new era of quantum-enhanced algorithms that can be leveraged to improve real-time incident detection a… ▽ More The efficiency and reliability of real-time incident detection models directly impact the affected corridors' traffic safety and operational conditions. The recent emergence of cloud-based quantum computing infrastructure and innovations in noisy intermediate-scale quantum devices have revealed a new era of quantum-enhanced algorithms that can be leveraged to improve real-time incident detection accuracy. In this research, a hybrid machine learning model, which includes classical and quantum machine learning (ML) models, is developed to identify incidents using the connected vehicle (CV) data. The incident detection performance of the hybrid model is evaluated against baseline classical ML models. The framework is evaluated using data from a microsimulation tool for different incident scenarios. The results indicate that a hybrid neural network containing a 4-qubit quantum layer outperforms all other baseline models when there is a lack of training data. We have created three datasets; DS-1 with sufficient training data, and DS-2 and DS-3 with insufficient training data. The hybrid model achieves a recall of 98.9%, 98.3%, and 96.6% for DS-1, DS-2, and DS-3, respectively. For DS-2 and DS-3, the average improvement in F2-score (measures model's performance to correctly identify incidents) achieved by the hybrid model is 1.9% and 7.8%, respectively, compared to the classical models. It shows that with insufficient data, which may be common for CVs, the hybrid ML model will perform better than the classical models. With the continuing improvements of quantum computing infrastructure, the quantum ML models could be a promising alternative for CV-related applications when the available data is insufficient. △ Less

Submitted 2 August, 2021; originally announced August 2021.

Comments: 14 pages, 10 figures

arXiv:2108.01125 [pdf]

Hybrid Classical-Quantum Deep Learning Models for Autonomous Vehicle Traffic Image Classification Under Adversarial Attack

Authors: Reek Majumder, Sakib Mahmud Khan, Fahim Ahmed, Zadid Khan, Frank Ngeni, Gurcan Comert, Judith Mwakalonge, Dimitra Michalaka, Mashrur Chowdhury

Abstract: Image classification must work for autonomous vehicles (AV) operating on public roads, and actions performed based on image misclassification can have serious consequences. Traffic sign images can be misclassified by an adversarial attack on machine learning models used by AVs for traffic sign recognition. To make classification models resilient against adversarial attacks, we used a hybrid deep-l… ▽ More Image classification must work for autonomous vehicles (AV) operating on public roads, and actions performed based on image misclassification can have serious consequences. Traffic sign images can be misclassified by an adversarial attack on machine learning models used by AVs for traffic sign recognition. To make classification models resilient against adversarial attacks, we used a hybrid deep-learning model with both the quantum and classical layers. Our goal is to study the hybrid deep-learning architecture for classical-quantum transfer learning models to support the current era of intermediate-scale quantum technology. We have evaluated the impacts of various white box adversarial attacks on these hybrid models. The classical part of hybrid models includes a convolution network from the pre-trained Resnet18 model, which extracts informative features from a high dimensional LISA traffic sign image dataset. The output from the classical processor is processed further through the quantum layer, which is composed of various quantum gates and provides support to various quantum mechanical features like entanglement and superposition. We have tested multiple combinations of quantum circuits to provide better classification accuracy with decreasing training data and found better resiliency for our hybrid classical-quantum deep learning model during attacks compared to the classical-only machine learning models. △ Less

Submitted 2 August, 2021; originally announced August 2021.

Comments: 16 pages, 7 figures

arXiv:2107.09622 [pdf, other]

More Parameters? No Thanks!

Authors: Zeeshan Khan, Kartheek Akella, Vinay P. Namboodiri, C V Jawahar

Abstract: This work studies the long-standing problems of model capacity and negative interference in multilingual neural machine translation MNMT. We use network pruning techniques and observe that pruning 50-70% of the parameters from a trained MNMT model results only in a 0.29-1.98 drop in the BLEU score. Suggesting that there exist large redundancies even in MNMT models. These observations motivate us t… ▽ More This work studies the long-standing problems of model capacity and negative interference in multilingual neural machine translation MNMT. We use network pruning techniques and observe that pruning 50-70% of the parameters from a trained MNMT model results only in a 0.29-1.98 drop in the BLEU score. Suggesting that there exist large redundancies even in MNMT models. These observations motivate us to use the redundant parameters and counter the interference problem efficiently. We propose a novel adaptation strategy, where we iteratively prune and retrain the redundant parameters of an MNMT to improve bilingual representations while retaining the multilinguality. Negative interference severely affects high resource languages, and our method alleviates it without any additional adapter modules. Hence, we call it parameter-free adaptation strategy, paving way for the efficient adaptation of MNMT. We demonstrate the effectiveness of our method on a 9 language MNMT trained on TED talks, and report an average improvement of +1.36 BLEU on high resource pairs. Code will be released here. △ Less

Submitted 20 July, 2021; originally announced July 2021.

Showing 1–50 of 191 results for author: Khan, Z