Search | arXiv e-print repository

Rigid-Body Attitude Control on $\mathsf{SO(3)}$ using Nonlinear Dynamic Inversion

Authors: Hafiz Zeeshan Iqbal Khan, Farooq Aslam, Muhammad Farooq Haydar, Jamshed Riaz

Abstract: This paper presents a cascaded control architecture, based on nonlinear dynamic inversion (NDI), for rigid body attitude control. The proposed controller works directly with the rotation matrix parameterization, that is, with elements of the Special Orthogonal Group $\mathsf{SO(3)}$, and avoids problems related to singularities and non-uniqueness which affect other commonly used attitude represent… ▽ More This paper presents a cascaded control architecture, based on nonlinear dynamic inversion (NDI), for rigid body attitude control. The proposed controller works directly with the rotation matrix parameterization, that is, with elements of the Special Orthogonal Group $\mathsf{SO(3)}$, and avoids problems related to singularities and non-uniqueness which affect other commonly used attitude representations such as Euler angles, unit quaternions, modified Rodrigues parameters, etc. The proposed NDI-based controller is capable of imposing desired linear dynamics of any order for the outer attitude loop and the inner rate loop, and gives control designers the flexibility to choose higher-order dynamic compensators in both loops. In addition, sufficient conditions are presented in the form of linear matrix inequalities (LMIs) which ensure that the outer loop controller renders the attitude loop almost globally asymptotically stable (AGAS) and the rate loop globally asymptotically stable (GAS). Furthermore, the overall cascaded control architecture is shown to be AGAS in the case of attitude error regulation. Lastly, the proposed scheme is compared with an Euler angles-based NDI scheme from literature for a tracking problem involving agile maneuvering of a multicopter in a high-fidelity nonlinear simulation. △ Less

Submitted 4 September, 2024; originally announced September 2024.

Comments: 7 pages, 6 figures, accepted in IEEE Conference on Decision and Control (CDC), 2024

arXiv:2409.01184 [pdf, other]

PitVis-2023 Challenge: Workflow Recognition in videos of Endoscopic Pituitary Surgery

Authors: Adrito Das, Danyal Z. Khan, Dimitrios Psychogyios, Yitong Zhang, John G. Hanrahan, Francisco Vasconcelos, You Pang, Zhen Chen, Jinlin Wu, Xiaoyang Zou, Guoyan Zheng, Abdul Qayyum, Moona Mazher, Imran Razzak, Tianbin Li, Jin Ye, Junjun He, Szymon Płotka, Joanna Kaleta, Amine Yamlahi, Antoine Jund, Patrick Godau, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa , et al. (7 additional authors not shown)

Abstract: The field of computer vision applied to videos of minimally invasive surgery is ever-growing. Workflow recognition pertains to the automated recognition of various aspects of a surgery: including which surgical steps are performed; and which surgical instruments are used. This information can later be used to assist clinicians when learning the surgery; during live surgery; and when writing operat… ▽ More The field of computer vision applied to videos of minimally invasive surgery is ever-growing. Workflow recognition pertains to the automated recognition of various aspects of a surgery: including which surgical steps are performed; and which surgical instruments are used. This information can later be used to assist clinicians when learning the surgery; during live surgery; and when writing operation notes. The Pituitary Vision (PitVis) 2023 Challenge tasks the community to step and instrument recognition in videos of endoscopic pituitary surgery. This is a unique task when compared to other minimally invasive surgeries due to the smaller working space, which limits and distorts vision; and higher frequency of instrument and step switching, which requires more precise model predictions. Participants were provided with 25-videos, with results presented at the MICCAI-2023 conference as part of the Endoscopic Vision 2023 Challenge in Vancouver, Canada, on 08-Oct-2023. There were 18-submissions from 9-teams across 6-countries, using a variety of deep learning models. A commonality between the top performing models was incorporating spatio-temporal and multi-task methods, with greater than 50% and 10% macro-F1-score improvement over purely spacial single-task models in step and instrument recognition respectively. The PitVis-2023 Challenge therefore demonstrates state-of-the-art computer vision models in minimally invasive surgery are transferable to a new dataset, with surgery specific techniques used to enhance performance, progressing the field further. Benchmark results are provided in the paper, and the dataset is publicly available at: https://doi.org/10.5522/04/26531686. △ Less

Submitted 2 September, 2024; originally announced September 2024.

arXiv:2408.12975 [pdf, other]

The UK Submillimetre and Millimetre Astronomy Roadmap 2024

Authors: K. Pattle, P. S. Barry, A. W. Blain, M. Booth, R. A. Booth, D. L. Clements, M. J. Currie, S. Doyle, D. Eden, G. A. Fuller, M. Griffin, P. G. Huggard, J. D. Ilee, J. Karoly, Z. A. Khan, N. Klimovich, E. Kontar, P. Klaassen, A. J. Rigby, P. Scicluna, S. Serjeant, B. -K. Tan, D. Ward-Thompson, T. G. Williams, T. A. Davis , et al. (9 additional authors not shown)

Abstract: In this Roadmap, we present a vision for the future of submillimetre and millimetre astronomy in the United Kingdom over the next decade and beyond. This Roadmap has been developed in response to the recommendation of the Astronomy Advisory Panel (AAP) of the STFC in the AAP Astronomy Roadmap 2022. In order to develop our stragetic priorities and recommendations, we surveyed the UK submillimetre a… ▽ More In this Roadmap, we present a vision for the future of submillimetre and millimetre astronomy in the United Kingdom over the next decade and beyond. This Roadmap has been developed in response to the recommendation of the Astronomy Advisory Panel (AAP) of the STFC in the AAP Astronomy Roadmap 2022. In order to develop our stragetic priorities and recommendations, we surveyed the UK submillimetre and millimetre community to determine their key priorities for both the near-term and long-term future of the field. We further performed detailed reviews of UK leadership in submillimetre/millimetre science and instrumentation. Our key strategic priorities are as follows: 1. The UK must be a key partner in the forthcoming AtLAST telescope, for which it is essential that the UK remains a key partner in the JCMT in the intermediate term. 2. The UK must maintain, and if possible enhance, access to ALMA and aim to lead parts of instrument development for ALMA2040. Our strategic priorities complement one another: AtLAST (a 50m single-dish telescope) and an upgraded ALMA (a large configurable interferometric array) would be in synergy, not competition, with one another. Both have identified and are working towards the same overarching science goals, and both are required in order to fully address these goals. △ Less

Submitted 3 September, 2024; v1 submitted 23 August, 2024; originally announced August 2024.

Comments: 91 pages plus cover, 38 figures. Submitted to the Science and Technology Facilities Council, August 2024. One figure corrected (v2); new appendix with STFC Q&A; corrected SMA access statement; updated references, acronyms & author list (v3)

arXiv:2408.08454 [pdf, other]

Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

Authors: Zohaib Khan, Muhammad Khaquan, Omer Tafveez, Burhanuddin Samiwala, Agha Ali Raza

Abstract: The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number… ▽ More The Transformer architecture has revolutionized deep learning through its Self-Attention mechanism, which effectively captures contextual information. However, the memory footprint of Self-Attention presents significant challenges for long-sequence tasks. Grouped Query Attention (GQA) addresses this issue by grouping queries and mean-pooling the corresponding key-value heads - reducing the number of overall parameters and memory requirements in a flexible manner without adversely compromising model accuracy. In this work, we introduce enhancements to GQA, focusing on two novel approaches that deviate from the static nature of grouping: Key-Distributed GQA (KDGQA) and Dynamic Key-Distributed GQA (DGQA), which leverage information from the norms of the key heads to inform query allocation. Specifically, KDGQA looks at the ratios of the norms of the key heads during each forward pass, while DGQA examines the ratios of the norms as they evolve through training. Additionally, we present Perturbed GQA (PGQA) as a case-study, which introduces variability in (static) group formation via subtracting noise from the attention maps. Our experiments with up-trained Vision Transformers, for Image Classification on datasets such as CIFAR-10, CIFAR-100, Food101, and Tiny ImageNet, demonstrate the promise of these variants in improving upon the original GQA through more informed and adaptive grouping mechanisms: specifically ViT-L experiences accuracy gains of up to 8% when utilizing DGQA in comparison to GQA and other variants. We further analyze the impact of the number of Key-Value Heads on performance, underscoring the importance of utilizing query-key affinities. Code is available on GitHub. △ Less

Submitted 28 August, 2024; v1 submitted 15 August, 2024; originally announced August 2024.

Comments: 11 pages, 9 figures

arXiv:2408.08372 [pdf, other]

On the Algebra of the Infrared with Twisted Masses

Authors: Ahsan Z. Khan, Gregory W. Moore

Abstract: The Algebra of the Infrared \cite{Gaiotto:2015aoa} is a framework to construct local observables, interfaces, and categories of supersymmetric boundary conditions of massive $\mathcal{N}=(2,2)$ theories in two dimensions by using information only about the BPS sector. The resulting framework is known as the ``web-based formalism.'' In this paper we initiate the generalization of the web-based form… ▽ More The Algebra of the Infrared \cite{Gaiotto:2015aoa} is a framework to construct local observables, interfaces, and categories of supersymmetric boundary conditions of massive $\mathcal{N}=(2,2)$ theories in two dimensions by using information only about the BPS sector. The resulting framework is known as the ``web-based formalism.'' In this paper we initiate the generalization of the web-based formalism to include a much wider class of $\mathcal{N}=(2,2)$ quantum field theories than was discussed in \cite{Gaiotto:2015aoa}: theories with non-trivial twisted masses. The essential new ingredient is the presence of BPS particles within a fixed vacuum sector. In this paper we work out the web-based formalism for the simplest class of theories that allow for such BPS particles: theories with a single vacuum and a single twisted mass. We show that even in this simple setting there are interesting new phenomenon including the emergence of Fock spaces of closed solitons and a natural appearance of Koszul dual algebras. Mathematically, studying theories with twisted masses includes studying the Fukaya-Seidel category of A-type boundary conditions for Landau-Ginzburg models defined by a closed holomorphic one-form. This paper sketches a web-based construction for the category of A-type boundary conditions for one-forms with a single Morse zero and a single non-trivial period. We demonstrate our formalism explicitly in a particularly instructive example. △ Less

Submitted 15 August, 2024; originally announced August 2024.

Comments: 112 pages

arXiv:2406.10889 [pdf, other]

VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Authors: Darshana Saravanan, Darshan Singh, Varun Gupta, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi

Abstract: Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships.… ▽ More Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships. To this end, we propose VELOCITI, a new benchmark building on complex movie clips and dense semantic role label annotations to test perception and binding in video language models (contrastive and Video-LLMs). Our perception-based tests require discriminating video-caption pairs that share similar entities, and the binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video. While current state-of-the-art models perform moderately well on perception tests, accuracy is near random when both entities are present in the same video, indicating that they fail at binding tests. Even the powerful Gemini 1.5 Flash has a substantial gap (16-28%) with respect to human accuracy in such binding tests. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 26 pages, 17 figures, 3 tables

arXiv:2405.17788 [pdf, other]

Enhancing Road Safety: Real-Time Detection of Driver Distraction through Convolutional Neural Networks

Authors: Amaan Aijaz Sheikh, Imaad Zaffar Khan

Abstract: As we navigate our daily commutes, the threat posed by a distracted driver is at a large, resulting in a troubling rise in traffic accidents. Addressing this safety concern, our project harnesses the analytical power of Convolutional Neural Networks (CNNs), with a particular emphasis on the well-established models VGG16 and VGG19. These models are acclaimed for their precision in image recognition… ▽ More As we navigate our daily commutes, the threat posed by a distracted driver is at a large, resulting in a troubling rise in traffic accidents. Addressing this safety concern, our project harnesses the analytical power of Convolutional Neural Networks (CNNs), with a particular emphasis on the well-established models VGG16 and VGG19. These models are acclaimed for their precision in image recognition and are meticulously tested for their ability to detect nuances in driver behavior under varying environmental conditions. Through a comparative analysis against an array of CNN architectures, this study seeks to identify the most efficient model for real-time detection of driver distractions. The ultimate aim is to incorporate the findings into vehicle safety systems, significantly boosting their capability to prevent accidents triggered by inattention. This research not only enhances our understanding of automotive safety technologies but also marks a pivotal step towards creating vehicles that are intuitively aligned with driver behaviors, ensuring safer roads for all. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.13949 [pdf, other]

PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

Authors: Runlong He, Mengya Xu, Adrito Das, Danyal Z. Khan, Sophia Bano, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarakol Islam

Abstract: Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the i… ▽ More Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the image and text modalities remains an open research challenge due to the inherent differences between these two types of information and the complexity involved in aligning them. This paper introduces PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical aspects such as phase and step recognition, context understanding, tool detection and localization, and tool-tissue interactions. PitVQA-Net consists of a novel image-grounded text embedding that projects image and text features into a shared embedding space and GPT2 Backbone with an excitation block classification head to generate contextually relevant answers within the complex domain of endonasal pituitary surgery. Our image-grounded text embedding leverages joint embedding, cross-attention and contextual representation to understand the contextual relationship between questions and surgical images. We demonstrate the effectiveness of PitVQA-Net on both the PitVQA and the publicly available EndoVis18-VQA dataset, achieving improvements in balanced accuracy of 8% and 9% over the most recent baselines, respectively. Our code and dataset is available at https://github.com/mobarakol/PitVQA. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 10 pages, 3 figures

arXiv:2405.11483 [pdf, other]

MICap: A Unified Model for Identity-aware Movie Descriptions

Authors: Haran Raajesh, Naveen Reddy Desanur, Zeeshan Khan, Makarand Tapaswi

Abstract: Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id label… ▽ More Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: CVPR 2024, Project Page: https://katha-ai.github.io/projects/micap/

arXiv:2405.09361 [pdf, other]

Quantum operations for Kramers-Wannier duality

Authors: Maaz Khan, Syed Anausha Bin Zakir Khan, Arif Mohd

Abstract: We study the Kramers-Wannier duality for the transverse-field Ising lattice on a ring. A careful consideration of the ring boundary conditions shows that the duality has to be implemented with a proper treatment of different charge sectors of both the twisted and untwisted Ising and the dual-Ising Hilbert spaces. We construct a superoperator that explicitly maps the Ising operators to the dual-Isi… ▽ More We study the Kramers-Wannier duality for the transverse-field Ising lattice on a ring. A careful consideration of the ring boundary conditions shows that the duality has to be implemented with a proper treatment of different charge sectors of both the twisted and untwisted Ising and the dual-Ising Hilbert spaces. We construct a superoperator that explicitly maps the Ising operators to the dual-Ising operators. The superoperator naturally acts on the tensor product of the Ising and the dual-Ising Hilbert space. We then show that the relation between our superoperator and the Kramers-Wannier duality operator that maps the Ising Hilbert space to the dual-Ising Hilbert space is naturally provided by quantum operations and the duality can be understood as a quantum operation that we construct. We provide the operator-sum representation for the Kramers-Wannier quantum operations and reproduce the well-known fusion rules. In addition to providing the quantum information perspective on the Kramers-Wannier duality, our explicit protocol will also be useful in implementing the Kramers-Wannier duality on a quantum computer. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: 7 pages, 1 table, 4 figures

arXiv:2405.02926 [pdf, ps, other]

Study of charge changing and interaction cross sections for 4$\leq$Z$ \leq$9 isotopes

Authors: M. Imran, Z. Hasan, A. A. Usmani, Z. A. Khan

Abstract: The root-mean-square proton and neutron radii for $^{7,9-12,14}\rm$ Be, $^{10-15,17}\rm$ B, $^{12-19}\rm$ C, $^{14,15,17-22}\rm$ N, $^{16,18-24}\rm$ O, and $^{18-21,23-26}\rm$ F isotopes are deduced from a systematic analysis of experimental charge changing and interaction cross sections in the framework of Glauber model. The calculations involve descriptions of nuclei based on Slater determinants… ▽ More The root-mean-square proton and neutron radii for $^{7,9-12,14}\rm$ Be, $^{10-15,17}\rm$ B, $^{12-19}\rm$ C, $^{14,15,17-22}\rm$ N, $^{16,18-24}\rm$ O, and $^{18-21,23-26}\rm$ F isotopes are deduced from a systematic analysis of experimental charge changing and interaction cross sections in the framework of Glauber model. The calculations involve descriptions of nuclei based on Slater determinants using harmonic oscillator single-particle wave functions. The extracted proton and neutron radii have been examined in the light of some important features such as neutron skin thickness/halo-like structure/subshell closure observed in exotic isotopes. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2405.02907 [pdf]

Nanostructured BiVO4 Photoanodes Fabricated by Vanadium-infused Interaction for Efficient Solar Water Splitting

Authors: Amar K. Salih, Abdul Zeeshan Khan, Qasem A. Drmosh, Tarek A. Kandiel, Mohammad Qamar, Tahir Naveed Jahangir, Cuong Ton-That, Zain H. Yamani

Abstract: Bismuth vanadate (BiVO4) has emerged as a highly prospective material for photoanodes in photoelectrochemical (PEC) water oxidation. However, current limitations with this material lie in the difficulties in producing stable and continuous BiVO4 layers with efficient carrier transfer kinetics, thereby impeding its widespread application in water splitting processes. This study introduces a new fab… ▽ More Bismuth vanadate (BiVO4) has emerged as a highly prospective material for photoanodes in photoelectrochemical (PEC) water oxidation. However, current limitations with this material lie in the difficulties in producing stable and continuous BiVO4 layers with efficient carrier transfer kinetics, thereby impeding its widespread application in water splitting processes. This study introduces a new fabrication approach that yields continuous, monoclinic nanostructured BiVO4 films, paving the way for their use as photoanodes in efficient PEC water oxidation for hydrogen production under solar light conditions. The fabrication involves the intercalation of vanadium (V) ions into Bi2O3 films at 450oC. Upon interaction with V ions, the film undergoes a transformation from tetragonal Bi2O3 to monoclinic scheelite BiVO4. This synthesis method enables the fabrication of single monoclinic phase BiVO4 films with thicknesses up to 270 nm. The engineered monoclinic BiVO4 film, devoid of any pinholes that could cause carrier loss, exhibits a robust photocurrent of 1.0 mA/cm2 at 1.23 VRHE in a neutral electrolyte, without requiring additional modifications or doping. Moreover, we demonstrate that the incorporation of a cobalt phosphate (Co-Pi) co-catalyst into the BiVO4 photoanode significantly enhances the lifetime of photogenerated holes by a factor of nine, resulting in a further elevation of the photocurrent to 2.9 mA/cm2. This remarkable PEC enhancement can be attributed to the surface state passivation by the Co-Pi co-catalyst. Our fabrication approach opens up a new facile route for producing large-scale, highly efficient BiVO4 photoanodes for PEC water splitting technology. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2404.10193 [pdf, other]

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering

Authors: Zaid Khan, Yun Fu

Abstract: The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only avail… ▽ More The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction, which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model, require retraining a model or study only unimodal models. However, the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals, are not retrainable by end-users, and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic, black-box setting. We propose using the principle of \textit{neighborhood consistency} to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response, the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead, we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable, even in adversarial settings or settings that are out-of-distribution to the proxy model. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2404.04627 [pdf, other]

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Yun Fu, Manmohan Chandraker

Abstract: Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training… ▽ More Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect, but it is unclear how to accomplish this. No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision, we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task, treat the LLM as a policy, and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection, compositional visual question answering, and image-text retrieval, and show that in each case, the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: https://zaidkhan.me/ViReP △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2403.09715 [pdf, other]

doi 10.1109/ICECCE49384.2020.9179319

Textual analysis of End User License Agreement for red-flagging potentially malicious software

Authors: Behraj Khan, Tahir Syed, Zeshan Khan, Muhammad Rafi

Abstract: New software and updates are downloaded by end users every day. Each dowloaded software has associated with it an End Users License Agreements (EULA), but this is rarely read. An EULA includes information to avoid legal repercussions. However,this proposes a host of potential problems such as spyware or producing an unwanted affect in the target system. End users do not read these EULA's because o… ▽ More New software and updates are downloaded by end users every day. Each dowloaded software has associated with it an End Users License Agreements (EULA), but this is rarely read. An EULA includes information to avoid legal repercussions. However,this proposes a host of potential problems such as spyware or producing an unwanted affect in the target system. End users do not read these EULA's because of length of the document and users find it extremely difficult to understand. Text summarization is one of the relevant solution to these kind of problems. This require a solution which can summarize the EULA and classify the EULA as "Benign" or "Malicious". We propose a solution in which we have summarize the EULA and classify the EULA as "Benign" or "Malicious". We extract EULA text of different sofware's then we classify the text using eight different supervised classifiers. we use ensemble learning to classify the EULA as benign or malicious using five different text summarization methods. An accuracy of $95.8$\% shows the effectiveness of the presented approach. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2402.05126 [pdf, other]

Graph Neural Network and NER-Based Text Summarization

Authors: Imaad Zaffar Khan, Amaan Aijaz Sheikh, Utkarsh Sinha

Abstract: With the abundance of data and information in todays time, it is nearly impossible for man, or, even machine, to go through all of the data line by line. What one usually does is to try to skim through the lines and retain the absolutely important information, that in a more formal term is called summarization. Text summarization is an important task that aims to compress lengthy documents or arti… ▽ More With the abundance of data and information in todays time, it is nearly impossible for man, or, even machine, to go through all of the data line by line. What one usually does is to try to skim through the lines and retain the absolutely important information, that in a more formal term is called summarization. Text summarization is an important task that aims to compress lengthy documents or articles into shorter, coherent representations while preserving the core information and meaning. This project introduces an innovative approach to text summarization, leveraging the capabilities of Graph Neural Networks (GNNs) and Named Entity Recognition (NER) systems. GNNs, with their exceptional ability to capture and process the relational data inherent in textual information, are adept at understanding the complex structures within large documents. Meanwhile, NER systems contribute by identifying and emphasizing key entities, ensuring that the summarization process maintains a focus on the most critical aspects of the text. By integrating these two technologies, our method aims to enhances the efficiency of summarization and also tries to ensures a high degree relevance in the condensed content. This project, therefore, offers a promising direction for handling the ever increasing volume of textual data in an information-saturated world. △ Less

Submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.12728 [pdf, other]

Filamentary Network and Magnetic Field Structures Revealed with BISTRO in the High-Mass Star-Forming Region NGC2264 : Global Properties and Local Magnetogravitational Configurations

Authors: Jia-Wei Wang, Patrick M. Koch, Seamus D. Clarke, Gary Fuller, Nicolas Peretto, Ya-Wen Tang, Hsi-Wei Yen, Shih-Ping Lai, Nagayoshi Ohashi, Doris Arzoumanian, Doug Johnstone, Ray Furuya, Shu-ichiro Inutsuka, Chang Won Lee, Derek Ward-Thompson, Valentin J. M. Le Gouellec, Hong-Li Liu, Lapo Fanciullo, Jihye Hwang, Kate Pattle, Frédérick Poidevin, Mehrnoosh Tahani, Takashi Onaka, Mark G. Rawlings, Eun Jung Chung , et al. (132 additional authors not shown)

Abstract: We report 850 $μ$m continuum polarization observations toward the filamentary high-mass star-forming region NGC 2264, taken as part of the B-fields In STar forming Regions Observations (BISTRO) large program on the James Clerk Maxwell Telescope (JCMT). These data reveal a well-structured non-uniform magnetic field in the NGC 2264C and 2264D regions with a prevailing orientation around 30 deg from… ▽ More We report 850 $μ$m continuum polarization observations toward the filamentary high-mass star-forming region NGC 2264, taken as part of the B-fields In STar forming Regions Observations (BISTRO) large program on the James Clerk Maxwell Telescope (JCMT). These data reveal a well-structured non-uniform magnetic field in the NGC 2264C and 2264D regions with a prevailing orientation around 30 deg from north to east. Field strengths estimates and a virial analysis for the major clumps indicate that NGC 2264C is globally dominated by gravity while in 2264D magnetic, gravitational, and kinetic energies are roughly balanced. We present an analysis scheme that utilizes the locally resolved magnetic field structures, together with the locally measured gravitational vector field and the extracted filamentary network. From this, we infer statistical trends showing that this network consists of two main groups of filaments oriented approximately perpendicular to one another. Additionally, gravity shows one dominating converging direction that is roughly perpendicular to one of the filament orientations, which is suggestive of mass accretion along this direction. Beyond these statistical trends, we identify two types of filaments. The type-I filament is perpendicular to the magnetic field with local gravity transitioning from parallel to perpendicular to the magnetic field from the outside to the filament ridge. The type-II filament is parallel to the magnetic field and local gravity. We interpret these two types of filaments as originating from the competition between radial collapsing, driven by filament self-gravity, and the longitudinal collapsing, driven by the region's global gravity. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: Accepted for publication in the Astrophysical Journal. 43 pages, 32 figures, and 4 tables (including Appendix)

arXiv:2401.12667 [pdf, ps, other]

Feature Selection via Robust Weighted Score for High Dimensional Binary Class-Imbalanced Gene Expression Data

Authors: Zardad Khan, Amjad Ali, Saeed Aldahmani

Abstract: In this paper, a robust weighted score for unbalanced data (ROWSU) is proposed for selecting the most discriminative feature for high dimensional gene expression binary classification with class-imbalance problem. The method addresses one of the most challenging problems of highly skewed class distributions in gene expression datasets that adversely affect the performance of classification algorit… ▽ More In this paper, a robust weighted score for unbalanced data (ROWSU) is proposed for selecting the most discriminative feature for high dimensional gene expression binary classification with class-imbalance problem. The method addresses one of the most challenging problems of highly skewed class distributions in gene expression datasets that adversely affect the performance of classification algorithms. First, the training dataset is balanced by synthetically generating data points from minority class observations. Second, a minimum subset of genes is selected using a greedy search approach. Third, a novel weighted robust score, where the weights are computed by support vectors, is introduced to obtain a refined set of genes. The highest-scoring genes based on this approach are combined with the minimum subset of genes selected by the greedy search approach to form the final set of genes. The novel method ensures the selection of the most discriminative genes, even in the presence of skewed class distribution, thus improving the performance of the classifiers. The performance of the proposed ROWSU method is evaluated on $6$ gene expression datasets. Classification accuracy and sensitivity are used as performance metrics to compare the proposed ROWSU algorithm with several other state-of-the-art methods. Boxplots and stability plots are also constructed for a better understanding of the results. The results show that the proposed method outperforms the existing feature selection procedures based on classification performance from k nearest neighbours (kNN) and random forest (RF) classifiers. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: 25 pages

MSC Class: 14J60

arXiv:2401.07669 [pdf, other]

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Authors: Darshan Singh S, Zeeshan Khan, Makarand Tapaswi

Abstract: While contrastive language image pretraining (CLIP) have exhibited impressive performance by learning highly semantic and generalized representations, recent works have exposed a fundamental drawback in its syntactic properties, that includes interpreting fine-grained attributes, actions, spatial relations, states, and details that require compositional reasoning. One reason for this is that natur… ▽ More While contrastive language image pretraining (CLIP) have exhibited impressive performance by learning highly semantic and generalized representations, recent works have exposed a fundamental drawback in its syntactic properties, that includes interpreting fine-grained attributes, actions, spatial relations, states, and details that require compositional reasoning. One reason for this is that natural captions often do not capture all the visual details of a scene. This leads to unaddressed visual concepts being misattributed to the wrong words. And the pooled image and text features, ends up acting as a bag of words, hence losing the syntactic information. In this work, we ask: Is it possible to enhance CLIP's fine-grained and syntactic abilities without compromising its semantic properties? We show that this is possible by adapting CLIP efficiently on a high-quality, comprehensive, and relatively small dataset. We demonstrate our adaptation strategy on VidSitu, a video situation recognition dataset annotated with verbs and rich semantic role labels (SRL). We use the SRL and verb information to create rule-based detailed captions, making sure they capture most of the visual concepts. Combined with hard negatives and hierarchical losses, these annotations allow us to learn a powerful visual representation, dubbed Fine-Grained CLIP (FiGCLIP), that preserves semantic understanding while being detail-oriented. We evaluate on five diverse vision-language tasks in both fine-tuning and zero-shot settings, achieving consistent improvements over the base CLIP model. △ Less

Submitted 15 January, 2024; originally announced January 2024.

arXiv:2311.09762 [pdf, other]

Graph Elicitation for Guiding Multi-Step Reasoning in Large Language Models

Authors: Jinyoung Park, Ameen Patel, Omar Zia Khan, Hyunwoo J. Kim, Joo-Kyung Kim

Abstract: Chain-of-Thought (CoT) prompting along with sub-question generation and answering has enhanced multi-step reasoning capabilities of Large Language Models (LLMs). However, prompting the LLMs to directly generate sub-questions is suboptimal since they sometimes generate redundant or irrelevant questions. To deal with them, we propose a GE-Reasoning method, which directs LLMs to generate proper sub-q… ▽ More Chain-of-Thought (CoT) prompting along with sub-question generation and answering has enhanced multi-step reasoning capabilities of Large Language Models (LLMs). However, prompting the LLMs to directly generate sub-questions is suboptimal since they sometimes generate redundant or irrelevant questions. To deal with them, we propose a GE-Reasoning method, which directs LLMs to generate proper sub-questions and corresponding answers. Concretely, given an input question, we first prompt the LLM to generate knowledge triplets, forming a graph representation of the question. Unlike conventional knowledge triplets, our approach allows variables as head or tail entities, effectively representing a question as knowledge triplets. Second, for each triplet, the LLM generates a corresponding sub-question and answer along with using knowledge retrieval. If the prediction confidence exceeds a threshold, the sub-question and prediction are incorporated into the prompt for subsequent processing. This approach encourages that sub-questions are grounded in the extracted knowledge triplets, reducing redundancy and irrelevance. Our experiments demonstrate that our approach outperforms previous CoT prompting methods and their variants on multi-hop question answering benchmark datasets. △ Less

Submitted 22 June, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

Comments: Preprint

arXiv:2310.20081 [pdf, other]

Integrating Summarization and Retrieval for Enhanced Personalization via Large Language Models

Authors: Chris Richardson, Yao Zhang, Kellen Gillespie, Sudipta Kar, Arshdeep Singh, Zeynab Raeesy, Omar Zia Khan, Abhinav Sethy

Abstract: Personalization, the ability to tailor a system to individual users, is an essential factor in user experience with natural language processing (NLP) systems. With the emergence of Large Language Models (LLMs), a key question is how to leverage these models to better personalize user experiences. To personalize a language model's output, a straightforward approach is to incorporate past user data… ▽ More Personalization, the ability to tailor a system to individual users, is an essential factor in user experience with natural language processing (NLP) systems. With the emergence of Large Language Models (LLMs), a key question is how to leverage these models to better personalize user experiences. To personalize a language model's output, a straightforward approach is to incorporate past user data into the language model prompt, but this approach can result in lengthy inputs exceeding limitations on input length and incurring latency and cost issues. Existing approaches tackle such challenges by selectively extracting relevant user data (i.e. selective retrieval) to construct a prompt for downstream tasks. However, retrieval-based methods are limited by potential information loss, lack of more profound user understanding, and cold-start challenges. To overcome these limitations, we propose a novel summary-augmented approach by extending retrieval-augmented personalization with task-aware user summaries generated by LLMs. The summaries can be generated and stored offline, enabling real-world systems with runtime constraints like voice assistants to leverage the power of LLMs. Experiments show our method with 75% less of retrieved user data is on-par or outperforms retrieval augmentation on most tasks in the LaMP personalization benchmark. We demonstrate that offline summarization via LLMs and runtime retrieval enables better performance for personalization on a range of tasks under practical constraints. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: 4 pages, International Workshop on Personalized Generative AI (@CIKM 2023)

ACM Class: I.2.7; H.3.3

arXiv:2310.17954 [pdf, other]

Multivessel Coronary Artery Segmentation and Stenosis Localisation using Ensemble Learning

Authors: Muhammad Bilal, Dinis Martinho, Reiner Sim, Adnan Qayyum, Hunaid Vohra, Massimo Caputo, Taofeek Akinosho, Sofiat Abioye, Zaheer Khan, Waleed Niaz, Junaid Qadir

Abstract: Coronary angiography analysis is a common clinical task performed by cardiologists to diagnose coronary artery disease (CAD) through an assessment of atherosclerotic plaque's accumulation. This study introduces an end-to-end machine learning solution developed as part of our solution for the MICCAI 2023 Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCA… ▽ More Coronary angiography analysis is a common clinical task performed by cardiologists to diagnose coronary artery disease (CAD) through an assessment of atherosclerotic plaque's accumulation. This study introduces an end-to-end machine learning solution developed as part of our solution for the MICCAI 2023 Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs (ARCADE) challenge, which aims to benchmark solutions for multivessel coronary artery segmentation and potential stenotic lesion localisation from X-ray coronary angiograms. We adopted a robust baseline model training strategy to progressively improve performance, comprising five successive stages of binary class pretraining, multivessel segmentation, fine-tuning using class frequency weighted dataloaders, fine-tuning using F1-based curriculum learning strategy (F1-CLS), and finally multi-target angiogram view classifier-based collective adaptation. Unlike many other medical imaging procedures, this task exhibits a notable degree of interobserver variability. %, making it particularly amenable to automated analysis. Our ensemble model combines the outputs from six baseline models using the weighted ensembling approach, which our analysis shows is found to double the predictive accuracy of the proposed solution. The final prediction was further refined, targeting the correction of misclassified blobs. Our solution achieved a mean F1 score of $37.69\%$ for coronary artery segmentation, and $39.41\%$ for stenosis localisation, positioning our team in the 5th position on both leaderboards. This work demonstrates the potential of automated tools to aid CAD diagnosis, guide interventions, and improve the accuracy of stent injections in clinical settings. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: Submission report for ARCADE challenge hosted at MICCAI2023

arXiv:2310.17050 [pdf, other]

Exploring Question Decomposition for Zero-Shot VQA

Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Manmohan Chandraker, Yun Fu

Abstract: Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their… ▽ More Visual question answering (VQA) has traditionally been treated as a single-step task where each question receives the same amount of effort, unlike natural human question-answering strategies. We explore a question decomposition strategy for VQA to overcome this limitation. We probe the ability of recently developed large vision-language models to use human-written decompositions and produce their own decompositions of visual questions, finding they are capable of learning both tasks from demonstrations alone. However, we show that naive application of model-written decompositions can hurt performance. We introduce a model-driven selective decomposition approach for second-guessing predictions and correcting errors, and validate its effectiveness on eight VQA tasks across three domains, showing consistent improvements in accuracy, including improvements of >20% on medical VQA datasets and boosting the zero-shot performance of BLIP-2 above chance on a VQA reformulation of the challenging Winoground task. Project Site: https://zaidkhan.me/decomposition-0shot-vqa/ △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023 Camera Ready

arXiv:2310.17032 [pdf, other]

Quantum Long Short-Term Memory (QLSTM) vs Classical LSTM in Time Series Forecasting: A Comparative Study in Solar Power Forecasting

Authors: Saad Zafar Khan, Nazeefa Muzammil, Salman Ghafoor, Haibat Khan, Syed Mohammad Hasan Zaidi, Abdulah Jeza Aljohani, Imran Aziz

Abstract: Accurate solar power forecasting is pivotal for the global transition towards sustainable energy systems. This study conducts a meticulous comparison between Quantum Long Short-Term Memory (QLSTM) and classical Long Short-Term Memory (LSTM) models for solar power production forecasting. The primary objective is to evaluate the potential advantages of QLSTMs, leveraging their exponential representa… ▽ More Accurate solar power forecasting is pivotal for the global transition towards sustainable energy systems. This study conducts a meticulous comparison between Quantum Long Short-Term Memory (QLSTM) and classical Long Short-Term Memory (LSTM) models for solar power production forecasting. The primary objective is to evaluate the potential advantages of QLSTMs, leveraging their exponential representational capabilities, in capturing the intricate spatiotemporal patterns inherent in renewable energy data. Through controlled experiments on real-world photovoltaic datasets, our findings reveal promising improvements offered by QLSTMs, including accelerated training convergence and substantially reduced test loss within the initial epoch compared to classical LSTMs. These empirical results demonstrate QLSTM's potential to swiftly assimilate complex time series relationships, enabled by quantum phenomena like superposition. However, realizing QLSTM's full capabilities necessitates further research into model validation across diverse conditions, systematic hyperparameter optimization, hardware noise resilience, and applications to correlated renewable forecasting problems. With continued progress, quantum machine learning can offer a paradigm shift in renewable energy time series prediction, potentially ushering in an era of unprecedented accuracy and reliability in solar power forecasting worldwide. This pioneering work provides initial evidence substantiating quantum advantages over classical LSTM models while acknowledging present limitations. Through rigorous benchmarking grounded in real-world data, our study illustrates a promising trajectory for quantum learning in renewable forecasting. △ Less

Submitted 9 April, 2024; v1 submitted 25 October, 2023; originally announced October 2023.

Comments: 33 pages, 9 figures

arXiv:2309.13033 [pdf, ps, other]

Robust Stability Analysis of a Class of LTV Systems

Authors: Shahzad Ahmed, Hafiz Zeeshan Iqbal Khan, Jamshed Riaz

Abstract: Many physical systems are inherently time-varying in nature. When these systems are linearized around a trajectory, generally, the resulting system is Linear Time-Varying (LTV). LTV systems describe an important class of linear systems and can be thought of as a natural extension of LTI systems. However, it is well known that, unlike LTI systems, the eigenvalues of an LTV system do not determine i… ▽ More Many physical systems are inherently time-varying in nature. When these systems are linearized around a trajectory, generally, the resulting system is Linear Time-Varying (LTV). LTV systems describe an important class of linear systems and can be thought of as a natural extension of LTI systems. However, it is well known that, unlike LTI systems, the eigenvalues of an LTV system do not determine its stability. In this paper, the stability conditions for a class of LTV systems are derived. This class is composed of piecewise LTV systems, i.e. LTV systems that are piecewise linear in time. Sufficient conditions of stability are derived in the form of linear matrix inequalities (LMIs) by using the Lyapunov stability criterion. The feasibility of LMIs guarantees the stability of a given piecewise LTV system. Furthermore, uncertain piecewise LTV systems with scalar parametric uncertainty are also considered. Sufficient conditions for robust stability of this case are also presented, which come out to be quasi-LMIs, which can be optimized using a bisection algorithm to find the bounds of uncertainty for which the system is stable. The proposed method is applied to the problem of pitch angle control of a space launch vehicle, and the results are presented. △ Less

Submitted 22 September, 2023; originally announced September 2023.

Comments: Presented at 20th International Bhurban Conference on Applied Sciences and Technology (IBCAST), 2023

arXiv:2309.13032 [pdf, ps, other]

Modelling, Simulation, and Control of a Flexible Space Launch Vehicle

Authors: Muhammad Abdullah Aamer, Qurat Ul Ain, Ushbah Kaleem, Hafiz Zeeshan Iqbal Khan, Jamshed Riaz

Abstract: Modern Space Launch Vehicles (SLVs), being slender in shape and due to the use of lightweight materials, are generally flexible in nature. This structural flexibility, when coupled with sensor and actuator dynamics, can adversely affect the control of SLV, which may lead to vehicle instability and, in the worst-case scenario, to structural failure. This work focuses on modelling and simulation of… ▽ More Modern Space Launch Vehicles (SLVs), being slender in shape and due to the use of lightweight materials, are generally flexible in nature. This structural flexibility, when coupled with sensor and actuator dynamics, can adversely affect the control of SLV, which may lead to vehicle instability and, in the worst-case scenario, to structural failure. This work focuses on modelling and simulation of rigid and flexible dynamics of an SLV and its interactions with the control system. SpaceX's Falcon 9 has been selected for this study. The flexible modes are calculated using modal analysis in Ansys. High-fidelity nonlinear simulation is developed which incorporates the flexible modes and their interactions with rigid degrees of freedom. Moreover, linearized models are developed for flexible body dynamics, over the complete trajectory until the first stage's separation. Using classical control methods, attitude controllers, that keep the SLV on its desired trajectory, are developed, and multiple filters are designed to suppress the interactions of flexible dynamics. The designed controllers along with filters are implemented in the nonlinear simulation. Furthermore, to demonstrate the robustness of designed controllers, Monte-Carlo simulations are carried out and results are presented. △ Less

Submitted 22 September, 2023; originally announced September 2023.

Comments: Presented at 20th International Bhurban Conference on Applied Sciences and Technology (IBCAST), 2023

arXiv:2309.10774 [pdf, ps, other]

Inverse Optimal Control Design of a VTOL Aircraft

Authors: Kinza Rehman, Hafiz Zeeshan Iqbal Khan, Muhammad Farooq Haydar

Abstract: A Vertical Takeoff and Landing (VTOL) aircraft is capable of short/vertical takeoffs and landings, thus eliminating the requirement of long runways. This feature makes them suitable for a broader variety of missions, generally not achievable by a traditional aircraft. The dynamics of the VTOL aircraft in the vertical plane has been used as a benchmark problem to demonstrate the effectiveness of di… ▽ More A Vertical Takeoff and Landing (VTOL) aircraft is capable of short/vertical takeoffs and landings, thus eliminating the requirement of long runways. This feature makes them suitable for a broader variety of missions, generally not achievable by a traditional aircraft. The dynamics of the VTOL aircraft in the vertical plane has been used as a benchmark problem to demonstrate the effectiveness of different nonlinear control techniques, mainly due to being highly coupled and non-minimum phase. This paper presents the design of an inverse optimal control for a simplified VTOL aircraft model. The proposed control law is based on a Control Lyapunov Function (CLF) which exploits the normal form typically used for control design through feedback linearization. The suggested control law is compared with dynamic feedback linearization based control law and a comparison is drawn based on their efficacy and robustness. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: Presented at 20th International Bhurban Conference on Applied Sciences and Technology (IBCAST), 2023

arXiv:2308.15827 [pdf, other]

Introducing Language Guidance in Prompt-based Continual Learning

Authors: Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, Muhammad Zeshan Afzal

Abstract: Continual Learning aims to learn a single model on a sequence of tasks without having access to data from previous tasks. The biggest challenge in the domain still remains catastrophic forgetting: a loss in performance on seen classes of earlier tasks. Some existing methods rely on an expensive replay buffer to store a chunk of data from previous tasks. This, while promising, becomes expensive whe… ▽ More Continual Learning aims to learn a single model on a sequence of tasks without having access to data from previous tasks. The biggest challenge in the domain still remains catastrophic forgetting: a loss in performance on seen classes of earlier tasks. Some existing methods rely on an expensive replay buffer to store a chunk of data from previous tasks. This, while promising, becomes expensive when the number of tasks becomes large or data can not be stored for privacy reasons. As an alternative, prompt-based methods have been proposed that store the task information in a learnable prompt pool. This prompt pool instructs a frozen image encoder on how to solve each task. While the model faces a disjoint set of classes in each task in this setting, we argue that these classes can be encoded to the same embedding space of a pre-trained language encoder. In this work, we propose Language Guidance for Prompt-based Continual Learning (LGCL) as a plug-in for prompt-based methods. LGCL is model agnostic and introduces language guidance at the task level in the prompt pool and at the class level on the output feature of the vision encoder. We show with extensive experimentation that LGCL consistently improves the performance of prompt-based continual learning methods to set a new state-of-the art. LGCL achieves these performance improvements without needing any additional learnable parameters. △ Less

Submitted 30 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023

arXiv:2307.16262 [pdf, other]

Validating polyp and instrument segmentation methods in colonoscopy through Medico 2020 and MedAI 2021 Challenges

Authors: Debesh Jha, Vanshali Sharma, Debapriya Banik, Debayan Bhattacharya, Kaushiki Roy, Steven A. Hicks, Nikhil Kumar Tomar, Vajira Thambawita, Adrian Krenzer, Ge-Peng Ji, Sahadev Poudel, George Batchkala, Saruar Alam, Awadelrahman M. A. Ahmed, Quoc-Huy Trinh, Zeshan Khan, Tien-Phat Nguyen, Shruti Shrestha, Sabari Nathan, Jeonghwan Gwak, Ritika K. Jha, Zheyuan Zhang, Alexander Schlaefer, Debotosh Bhattacharjee, M. K. Bhuyan , et al. (8 additional authors not shown)

Abstract: Automatic analysis of colonoscopy images has been an active field of research motivated by the importance of early detection of precancerous polyps. However, detecting polyps during the live examination can be challenging due to various factors such as variation of skills and experience among the endoscopists, lack of attentiveness, and fatigue leading to a high polyp miss-rate. Deep learning has… ▽ More Automatic analysis of colonoscopy images has been an active field of research motivated by the importance of early detection of precancerous polyps. However, detecting polyps during the live examination can be challenging due to various factors such as variation of skills and experience among the endoscopists, lack of attentiveness, and fatigue leading to a high polyp miss-rate. Deep learning has emerged as a promising solution to this challenge as it can assist endoscopists in detecting and classifying overlooked polyps and abnormalities in real time. In addition to the algorithm's accuracy, transparency and interpretability are crucial to explaining the whys and hows of the algorithm's prediction. Further, most algorithms are developed in private data, closed source, or proprietary software, and methods lack reproducibility. Therefore, to promote the development of efficient and transparent methods, we have organized the "Medico automatic polyp segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image Segmentation (MedAI 2021)" competitions. We present a comprehensive summary and analyze each contribution, highlight the strength of the best-performing methods, and discuss the possibility of clinical translations of such methods into the clinic. For the transparency task, a multi-disciplinary team, including expert gastroenterologists, accessed each submission and evaluated the team based on open-source practices, failure case analysis, ablation studies, usability and understandability of evaluations to gain a deeper understanding of the models' credibility for clinical deployment. Through the comprehensive analysis of the challenge, we not only highlight the advancements in polyp and surgical instrument segmentation but also encourage qualitative evaluation for building more transparent and understandable AI-based colonoscopy systems. △ Less

Submitted 6 May, 2024; v1 submitted 30 July, 2023; originally announced July 2023.

arXiv:2306.03932 [pdf, other]

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

Authors: Zaid Khan, Vijay Kumar BG, Samuel Schulter, Xiang Yu, Yun Fu, Manmohan Chandraker

Abstract: Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose VQA. While collecting additional labels for specialized tasks or domains can be challe… ▽ More Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose VQA. While collecting additional labels for specialized tasks or domains can be challenging, unlabeled images are often available. We introduce SelTDA (Self-Taught Data Augmentation), a strategy for finetuning large VLMs on small-scale VQA datasets. SelTDA uses the VLM and target dataset to build a teacher model that can generate question-answer pseudolabels directly conditioned on an image alone, allowing us to pseudolabel unlabeled images. SelTDA then finetunes the initial VLM on the original dataset augmented with freshly pseudolabeled images. We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions, counterfactual examples and rephrasings, improves domain generalization, and results in greater retention of numerical reasoning skills. The proposed strategy requires no additional annotations or architectural modifications, and is compatible with any modern encoder-decoder multimodal transformer. Code available at https://github.com/codezakh/SelTDA. △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: CVPR 2023

arXiv:2306.01819 [pdf]

Comparative Analysis of Widely use Object-Oriented Languages

Authors: Muhammad Shoaib Farooq, Taymour zaman Khan

Abstract: Programming is an integral part of computer science discipline. Every day the programming environment is not only rapidly growing but also changing and languages are constantly evolving. Learning of object-oriented paradigm is compulsory in every computer science major so the choice of language to teach object-oriented principles is very important. Due to large pool of object-oriented languages, i… ▽ More Programming is an integral part of computer science discipline. Every day the programming environment is not only rapidly growing but also changing and languages are constantly evolving. Learning of object-oriented paradigm is compulsory in every computer science major so the choice of language to teach object-oriented principles is very important. Due to large pool of object-oriented languages, it is difficult to choose which should be the first programming language in order to teach object-oriented principles. Many studies shown which should be the first language to tech object-oriented concepts but there is no method to compare and evaluate these languages. In this article we proposed a comprehensive framework to evaluate the widely used object-oriented languages. The languages are evaluated basis of their technical and environmental features. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: 30 pages, figures 2

arXiv:2305.15897 [pdf, other]

doi 10.1007/s10664-024-10533-w

Impact of Log Parsing on Deep Learning-Based Anomaly Detection

Authors: Zanis Ali Khan, Donghwan Shin, Domenico Bianculli, Lionel Briand

Abstract: Software systems log massive amounts of data, recording important runtime information. Such logs are used, for example, for log-based anomaly detection, which aims to automatically detect abnormal behaviors of the system under analysis by processing the information recorded in its logs. Many log-based anomaly detection techniques based on deep learning models include a pre-processing step called l… ▽ More Software systems log massive amounts of data, recording important runtime information. Such logs are used, for example, for log-based anomaly detection, which aims to automatically detect abnormal behaviors of the system under analysis by processing the information recorded in its logs. Many log-based anomaly detection techniques based on deep learning models include a pre-processing step called log parsing. However, understanding the impact of log parsing on the accuracy of anomaly detection techniques has received surprisingly little attention so far. Investigating what are the key properties log parsing techniques should ideally have to help anomaly detection is therefore warranted. In this paper, we report on a comprehensive empirical study on the impact of log parsing on anomaly detection accuracy, using 13 log parsing techniques, seven anomaly detection techniques (five based on deep learning and two based on traditional machine learning) on three publicly available log datasets. Our empirical results show that, despite what is widely assumed, there is no strong correlation between log parsing accuracy and anomaly detection accuracy, regardless of the metric used for measuring log parsing accuracy. Moreover, we experimentally confirm existing theoretical results showing that it is a property that we refer to as distinguishability in log parsing results as opposed to their accuracy that plays an essential role in achieving accurate anomaly detection. △ Less

Submitted 19 August, 2024; v1 submitted 25 May, 2023; originally announced May 2023.

Journal ref: Empir Software Eng 29, 139 (2024)

arXiv:2305.06934 [pdf, other]

Humans are Still Better than ChatGPT: Case of the IEEEXtreme Competition

Authors: Anis Koubaa, Basit Qureshi, Adel Ammar, Zahid Khan, Wadii Boulila, Lahouari Ghouti

Abstract: Since the release of ChatGPT, numerous studies have highlighted the remarkable performance of ChatGPT, which often rivals or even surpasses human capabilities in various tasks and domains. However, this paper presents a contrasting perspective by demonstrating an instance where human performance excels in typical tasks suited for ChatGPT, specifically in the domain of computer programming. We util… ▽ More Since the release of ChatGPT, numerous studies have highlighted the remarkable performance of ChatGPT, which often rivals or even surpasses human capabilities in various tasks and domains. However, this paper presents a contrasting perspective by demonstrating an instance where human performance excels in typical tasks suited for ChatGPT, specifically in the domain of computer programming. We utilize the IEEExtreme Challenge competition as a benchmark, a prestigious, annual international programming contest encompassing a wide range of problems with different complexities. To conduct a thorough evaluation, we selected and executed a diverse set of 102 challenges, drawn from five distinct IEEExtreme editions, using three major programming languages: Python, Java, and C++. Our empirical analysis provides evidence that contrary to popular belief, human programmers maintain a competitive edge over ChatGPT in certain aspects of problem-solving within the programming context. In fact, we found that the average score obtained by ChatGPT on the set of IEEExtreme programming problems is 3.9 to 5.8 times lower than the average human score, depending on the programming language. This paper elaborates on these findings, offering critical insights into the limitations and potential areas of improvement for AI-based language models like ChatGPT. △ Less

Submitted 10 May, 2023; originally announced May 2023.

Comments: 9 pages, 3 figures

arXiv:2304.09756 [pdf, other]

Contactless Human Activity Recognition using Deep Learning with Flexible and Scalable Software Define Radio

Authors: Muhammad Zakir Khan, Jawad Ahmad, Wadii Boulila, Matthew Broadbent, Syed Aziz Shah, Anis Koubaa, Qammer H. Abbasi

Abstract: Ambient computing is gaining popularity as a major technological advancement for the future. The modern era has witnessed a surge in the advancement in healthcare systems, with viable radio frequency solutions proposed for remote and unobtrusive human activity recognition (HAR). Specifically, this study investigates the use of Wi-Fi channel state information (CSI) as a novel method of ambient sens… ▽ More Ambient computing is gaining popularity as a major technological advancement for the future. The modern era has witnessed a surge in the advancement in healthcare systems, with viable radio frequency solutions proposed for remote and unobtrusive human activity recognition (HAR). Specifically, this study investigates the use of Wi-Fi channel state information (CSI) as a novel method of ambient sensing that can be employed as a contactless means of recognizing human activity in indoor environments. These methods avoid additional costly hardware required for vision-based systems, which are privacy-intrusive, by (re)using Wi-Fi CSI for various safety and security applications. During an experiment utilizing universal software-defined radio (USRP) to collect CSI samples, it was observed that a subject engaged in six distinct activities, which included no activity, standing, sitting, and leaning forward, across different areas of the room. Additionally, more CSI samples were collected when the subject walked in two different directions. This study presents a Wi-Fi CSI-based HAR system that assesses and contrasts deep learning approaches, namely convolutional neural network (CNN), long short-term memory (LSTM), and hybrid (LSTM+CNN), employed for accurate activity recognition. The experimental results indicate that LSTM surpasses current models and achieves an average accuracy of 95.3% in multi-activity classification when compared to CNN and hybrid techniques. In the future, research needs to study the significance of resilience in diverse and dynamic environments to identify the activity of multiple users. △ Less

Submitted 18 April, 2023; originally announced April 2023.

arXiv:2304.06533 [pdf]

Probing magnetic ordering in air stable iron-rich van der Waals minerals

Authors: Muhammad Zubair Khan, Oleg E. Peil, Apoorva Sharma, Oleksandr Selyshchev, Sergio Valencia, Florian Kronast, Maik Zimmermann, Muhammad Awais Aslam, Johann G. Raith, Christian Teichert, Dietrich R. T. Zahn, Georgeta Salvan, Aleksandar Matković, Chair of Physics, Department Physics, Mechanics, Electrical engineering, Montanuniversität Leoben, 8700, Leoben, Austria., Materials Center Leoben Forschung GmbH, 8700, Leoben, Austria. , et al. (24 additional authors not shown)

Abstract: In the rapidly expanding field of two-dimensional materials, magnetic monolayers show great promise for the future applications in nanoelectronics, data storage, and sensing. The research in intrinsically magnetic two-dimensional materials mainly focuses on synthetic iodide and telluride based compounds, which inherently suffer from the lack of ambient stability. So far, naturally occurring layere… ▽ More In the rapidly expanding field of two-dimensional materials, magnetic monolayers show great promise for the future applications in nanoelectronics, data storage, and sensing. The research in intrinsically magnetic two-dimensional materials mainly focuses on synthetic iodide and telluride based compounds, which inherently suffer from the lack of ambient stability. So far, naturally occurring layered magnetic materials have been vastly overlooked. These minerals offer a unique opportunity to explore air-stable complex layered systems with high concentration of local moment bearing ions. We demonstrate magnetic ordering in iron-rich two-dimensional phyllosilicates, focusing on mineral species of minnesotaite, annite, and biotite. These are naturally occurring van der Waals magnetic materials which integrate local moment baring ions of iron via magnesium/aluminium substitution in their octahedral sites. Due to self-inherent capping by silicate/aluminate tetrahedral groups, ultra-thin layers are air-stable. Chemical characterization, quantitative elemental analysis, and iron oxidation states were determined via Raman spectroscopy, wavelength disperse X-ray spectroscopy, X-ray absorption spectroscopy, and X-ray photoelectron spectroscopy. Superconducting quantum interference device magnetometry measurements were performed to examine the magnetic ordering. These layered materials exhibit paramagnetic or superparamagnetic characteristics at room temperature. At low temperature ferrimagnetic or antiferromagnetic ordering occurs, with the critical ordering temperature of 38.7 K for minnesotaite, 36.1 K for annite, and 4.9 K for biotite. In-field magnetic force microscopy on iron bearing phyllosilicates confirmed the paramagnetic response at room temperature, present down to monolayers. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: 19 pages, 6 figures

arXiv:2304.04161 [pdf]

Detection of COVID19 in Chest X-Ray Images Using Transfer Learning

Authors: Zanoby N. Khan

Abstract: COVID19 is a highly contagious disease infected millions of people worldwide. With limited testing components, screening tools such as chest radiography can assist the clinicians in the diagnosis and assessing the progress of disease. The performance of deep learning-based systems for diagnosis of COVID-19 disease in radiograph images has been encouraging. This paper investigates the concept of tr… ▽ More COVID19 is a highly contagious disease infected millions of people worldwide. With limited testing components, screening tools such as chest radiography can assist the clinicians in the diagnosis and assessing the progress of disease. The performance of deep learning-based systems for diagnosis of COVID-19 disease in radiograph images has been encouraging. This paper investigates the concept of transfer learning using two of the most well-known VGGNet architectures, namely VGG-16 and VGG-19. The classifier block and hyperparameters are fine-tuned to adopt the models for automatic detection of Covid-19 in chest x-ray images. We generated two different datasets to evaluate the performance of the proposed system for the identification of positive Covid-19 instances in a multiclass and binary classification problems. The experimental outcome demonstrates the usefulness of transfer learning for small-sized datasets particularly in the field of medical imaging, not only to prevent over-fitting and convergence problems but also to attain optimal classification performance as well. △ Less

Submitted 9 April, 2023; originally announced April 2023.

arXiv:2304.03561 [pdf]

doi 10.1109/10.1109/OJVT.2024.3437470

A Low-Complexity Diversity-Preserving Universal Bit-Flipping Enhanced Hard Decision Decoder for Arbitrary Linear Codes

Authors: Praveen Sai Bere, Mohammed Zafar Ali Khan, Lajos Hanzo

Abstract: V2X (Vehicle-to-everything) communication relies on short messages for short-range transmissions over a fading wireless channel, yet requires high reliability and low latency. Hard-decision decoding sacrifices the preservation of diversity order, leading to pronounced performance degradation in fading channels. By contrast, soft-decision decoding retains diversity order, albeit at the cost of in… ▽ More V2X (Vehicle-to-everything) communication relies on short messages for short-range transmissions over a fading wireless channel, yet requires high reliability and low latency. Hard-decision decoding sacrifices the preservation of diversity order, leading to pronounced performance degradation in fading channels. By contrast, soft-decision decoding retains diversity order, albeit at the cost of increased computational complexity. We introduce a novel enhanced hard-decision decoder termed as the Diversity Flip decoder (DFD) designed for preserving the diversity order. Moreover, it exhibits 'universal' applicability to all linear block codes. For a $\mathscr{C}(n,k)$ code having a minimum distance ${d_{\min}}$, the proposed decoder incurs a worst-case complexity order of $2^{({d_{\min}}-1)}-1$. Notably, for codes having low ${d_{\min}}$, this complexity represents a significant reduction compared to the popular soft and hard decision decoding algorithms. Due to its capability of maintaining diversity at a low complexity, it is eminently suitable for applications such as V2X (Vehicle-to-everything), IoT (Internet of Things), mMTC (Massive Machine type Communications), URLLC (Ultra-Reliable Low Latency Communications) and WBAN (Wireless Body Area Networks) for efficient decoding with favorable performance characteristics. The simulation results provided for various known codes and decoding algorithms validate the performance versus complexity benefits of the proposed decoder. △ Less

Submitted 7 August, 2024; v1 submitted 7 April, 2023; originally announced April 2023.

Comments: Journal of 23 pages

arXiv:2303.12210 [pdf, ps, other]

A Random Projection k Nearest Neighbours Ensemble for Classification via Extended Neighbourhood Rule

Authors: Amjad Ali, Muhammad Hamraz, Dost Muhammad Khan, Wajdan Deebani, Zardad Khan

Abstract: Ensembles based on k nearest neighbours (kNN) combine a large number of base learners, each constructed on a sample taken from a given training data. Typical kNN based ensembles determine the k closest observations in the training data bounded to a test sample point by a spherical region to predict its class. In this paper, a novel random projection extended neighbourhood rule (RPExNRule) ensemble… ▽ More Ensembles based on k nearest neighbours (kNN) combine a large number of base learners, each constructed on a sample taken from a given training data. Typical kNN based ensembles determine the k closest observations in the training data bounded to a test sample point by a spherical region to predict its class. In this paper, a novel random projection extended neighbourhood rule (RPExNRule) ensemble is proposed where bootstrap samples from the given training data are randomly projected into lower dimensions for additional randomness in the base models and to preserve features information. It uses the extended neighbourhood rule (ExNRule) to fit kNN as base learners on randomly projected bootstrap samples. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: 23 pages, 8 diagrams, 69 references

ACM Class: F.2.2

arXiv:2303.11866 [pdf, other]

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning

Authors: Zaid Khan, Yun Fu

Abstract: Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of paramet… ▽ More Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates ($<$7%) can achieve the same performance as full-model training, and updating specific components ($<$1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energy-efficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at https://github.com/codezakh/LilT. △ Less

Submitted 21 March, 2023; originally announced March 2023.

Comments: Accepted to ICLR 2023

arXiv:2303.00337 [pdf, other]

doi 10.1016/j.engappai.2022.105095

TAU: A Framework for Video-Based Traffic Analytics Leveraging Artificial Intelligence and Unmanned Aerial Systems

Authors: Bilel Benjdira, Anis Koubaa, Ahmad Taher Azar, Zahid Khan, Adel Ammar, Wadii Boulila

Abstract: Smart traffic engineering and intelligent transportation services are in increasing demand from governmental authorities to optimize traffic performance and thus reduce energy costs, increase the drivers' safety and comfort, ensure traffic laws enforcement, and detect traffic violations. In this paper, we address this challenge, and we leverage the use of Artificial Intelligence (AI) and Unmanned… ▽ More Smart traffic engineering and intelligent transportation services are in increasing demand from governmental authorities to optimize traffic performance and thus reduce energy costs, increase the drivers' safety and comfort, ensure traffic laws enforcement, and detect traffic violations. In this paper, we address this challenge, and we leverage the use of Artificial Intelligence (AI) and Unmanned Aerial Vehicles (UAVs) to develop an AI-integrated video analytics framework, called TAU (Traffic Analysis from UAVs), for automated traffic analytics and understanding. Unlike previous works on traffic video analytics, we propose an automated object detection and tracking pipeline from video processing to advanced traffic understanding using high-resolution UAV images. TAU combines six main contributions. First, it proposes a pre-processing algorithm to adapt the high-resolution UAV image as input to the object detector without lowering the resolution. This ensures an excellent detection accuracy from high-quality features, particularly the small size of detected objects from UAV images. Second, it introduces an algorithm for recalibrating the vehicle coordinates to ensure that vehicles are uniquely identified and tracked across the multiple crops of the same frame. Third, it presents a speed calculation algorithm based on accumulating information from successive frames. Fourth, TAU counts the number of vehicles per traffic zone based on the Ray Tracing algorithm. Fifth, TAU has a fully independent algorithm for crossroad arbitration based on the data gathered from the different zones surrounding it. Sixth, TAU introduces a set of algorithms for extracting twenty-four types of insights from the raw data collected. The code is shared here: https://github.com/bilel-bj/TAU. Video demonstrations are provided here: https://youtu.be/wXJV0H7LviU and here: https://youtu.be/kGv0gmtVEbI. △ Less

Submitted 1 March, 2023; originally announced March 2023.

Comments: This is the final proofread version submitted to Elsevier EAAI: please see the published version at: https://doi.org/10.1016/j.engappai.2022.105095

Journal ref: Engineering Applications of Artificial Intelligence, Volume 114, 2022, 105095, ISSN 0952-1976

arXiv:2302.10978 [pdf, other]

Learning to Retrieve Engaging Follow-Up Queries

Authors: Christopher Richardson, Sudipta Kar, Anjishnu Kumar, Anand Ramachandran, Omar Zia Khan, Zeynab Raeesy, Abhinav Sethy

Abstract: Open domain conversational agents can answer a broad range of targeted queries. However, the sequential nature of interaction with these systems makes knowledge exploration a lengthy task which burdens the user with asking a chain of well phrased questions. In this paper, we present a retrieval based system and associated dataset for predicting the next questions that the user might have. Such a s… ▽ More Open domain conversational agents can answer a broad range of targeted queries. However, the sequential nature of interaction with these systems makes knowledge exploration a lengthy task which burdens the user with asking a chain of well phrased questions. In this paper, we present a retrieval based system and associated dataset for predicting the next questions that the user might have. Such a system can proactively assist users in knowledge exploration leading to a more engaging dialog. The retrieval system is trained on a dataset which contains ~14K multi-turn information-seeking conversations with a valid follow-up question and a set of invalid candidates. The invalid candidates are generated to simulate various syntactic and semantic confounders such as paraphrases, partial entity match, irrelevant entity, and ASR errors. We use confounder specific techniques to simulate these negative examples on the OR-QuAC dataset and develop a dataset called the Follow-up Query Bank (FQ-Bank). Then, we train ranking models on FQ-Bank and present results comparing supervised and unsupervised approaches. The results suggest that we can retrieve the valid follow-ups by ranking them in higher positions compared to confounders, but further knowledge grounding can improve ranking performance. △ Less

Submitted 21 February, 2023; originally announced February 2023.

Comments: EACL 2023

arXiv:2301.04115 [pdf, other]

Sensing the Environment with 5G Scattered Signals (5G-CommSense): A Feasibility Analysis

Authors: Sandip Jana, Amit Kumar Mishra, Mohammed Zafar Ali Khan

Abstract: By making use of the sensors and AI (SensAI) algorithms for a specialized task, Application Specific INstrumentation (ASIN) framework uses less computational overhead and gives a good performance. This work evaluates the feasibility of the ASIN framework dependent Communication based Sensing (CommSense) system using 5th Generation New Radio (5G NR) infrastructure. Since our proposed system is back… ▽ More By making use of the sensors and AI (SensAI) algorithms for a specialized task, Application Specific INstrumentation (ASIN) framework uses less computational overhead and gives a good performance. This work evaluates the feasibility of the ASIN framework dependent Communication based Sensing (CommSense) system using 5th Generation New Radio (5G NR) infrastructure. Since our proposed system is backed up by 5G NR infra, this system is termed as 5G-CommSense. In this paper, we have used NR channel models specified by the 3rd Generation Partnership Project (3GPP) and added white Gaussian noise (AWGN) to vary the signal to noise ratio at the receiver. Finally, from our simulation result, we conclude that the proposed system is practically feasible. △ Less

Submitted 10 January, 2023; originally announced January 2023.

Comments: 3 pages, Accepted in conference

arXiv:2301.00343 [pdf, ps, other]

On the $A_{\infty}$-Category of a Holomorphic Moment Map

Authors: Ahsan Z. Khan

Abstract: Let $M$ be a hyperKähler manifold equipped with a $U(1)$ hyperKähler isometry, and let $I$ be a complex structure on $M$. In this note, we study the $A_{\infty}$-category of A-branes for the Landau-Ginzburg model with target space $(M,I)$, and superpotential being the $I$-holomorphic moment map. We show that if $I$ is a generic complex structure, the $A_{\infty}$-category is semi-simple. For excep… ▽ More Let $M$ be a hyperKähler manifold equipped with a $U(1)$ hyperKähler isometry, and let $I$ be a complex structure on $M$. In this note, we study the $A_{\infty}$-category of A-branes for the Landau-Ginzburg model with target space $(M,I)$, and superpotential being the $I$-holomorphic moment map. We show that if $I$ is a generic complex structure, the $A_{\infty}$-category is semi-simple. For exceptional complex structures, though typically not semi-simple, the category still has no instanton corrections. We illustrate the $A_{\infty}$-category at both generic and exceptional loci when $M$ is the cotangent bundle of the projective line. △ Less

Submitted 31 December, 2022; originally announced January 2023.

Comments: 13 pages

arXiv:2212.02291 [pdf, other]

I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

Authors: Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian, Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, Federico Tombari

Abstract: Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowled… ▽ More Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class(referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2211.11278 [pdf, ps, other]

Optimal Extended Neighbourhood Rule $k$ Nearest Neighbours Ensemble

Authors: Amjad Ali, Zardad Khan, Dost Muhammad Khan, Saeed Aldahmani

Abstract: The traditional k nearest neighbor (kNN) approach uses a distance formula within a spherical region to determine the k closest training observations to a test sample point. However, this approach may not work well when test point is located outside this region. Moreover, aggregating many base kNN learners can result in poor ensemble performance due to high classification errors. To address these i… ▽ More The traditional k nearest neighbor (kNN) approach uses a distance formula within a spherical region to determine the k closest training observations to a test sample point. However, this approach may not work well when test point is located outside this region. Moreover, aggregating many base kNN learners can result in poor ensemble performance due to high classification errors. To address these issues, a new optimal extended neighborhood rule based ensemble method is proposed in this paper. This rule determines neighbors in k steps starting from the closest sample point to the unseen observation and selecting subsequent nearest data points until the required number of observations is reached. Each base model is constructed on a bootstrap sample with a random subset of features, and optimal models are selected based on out-of-bag performance after building a sufficient number of models. The proposed ensemble is compared with state-of-the-art methods on 17 benchmark datasets using accuracy, Cohen's kappa, and Brier score (BS). The performance of the proposed method is also assessed by adding contrived features in the original data. △ Less

Submitted 15 February, 2024; v1 submitted 21 November, 2022; originally announced November 2022.

Comments: This manuscript has been submitted for publication in the esteemed journal Pattern Recognition Letters

MSC Class: 14J60

arXiv:2210.11557 [pdf, other]

Learning Attention Propagation for Compositional Zero-Shot Learning

Authors: Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Alain Pagani, Didier Stricker, Muhammad Zeshan Afzal

Abstract: Compositional zero-shot learning aims to recognize unseen compositions of seen visual primitives of object classes and their states. While all primitives (states and objects) are observable during training in some combination, their complex interaction makes this task especially hard. For example, wet changes the visual appearance of a dog very differently from a bicycle. Furthermore, we argue tha… ▽ More Compositional zero-shot learning aims to recognize unseen compositions of seen visual primitives of object classes and their states. While all primitives (states and objects) are observable during training in some combination, their complex interaction makes this task especially hard. For example, wet changes the visual appearance of a dog very differently from a bicycle. Furthermore, we argue that relationships between compositions go beyond shared states or objects. A cluttered office can contain a busy table; even though these compositions don't share a state or object, the presence of a busy table can guide the presence of a cluttered office. We propose a novel method called Compositional Attention Propagated Embedding (CAPE) as a solution. The key intuition to our method is that a rich dependency structure exists between compositions arising from complex interactions of primitives in addition to other dependencies between compositions. CAPE learns to identify this structure and propagates knowledge between them to learn class embedding for all seen and unseen compositions. In the challenging generalized compositional zero-shot setting, we show that our method outperforms previous baselines to set a new state-of-the-art on three publicly available benchmarks. △ Less

Submitted 20 October, 2022; originally announced October 2022.

arXiv:2210.10828 [pdf, other]

Grounded Video Situation Recognition

Authors: Zeeshan Khan, C. V. Jawahar, Makarand Tapaswi

Abstract: Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambigua… ▽ More Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatio-temporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: Accepted to NeurIPS 2022. Project Page: https://zeeshank95.github.io/grvidsitu

arXiv:2210.04429 [pdf, other]

DeepHS-HDRVideo: Deep High Speed High Dynamic Range Video Reconstruction

Authors: Zeeshan Khan, Parth Shettiwar, Mukul Khanna, Shanmuganathan Raman

Abstract: Due to hardware constraints, standard off-the-shelf digital cameras suffers from low dynamic range (LDR) and low frame per second (FPS) outputs. Previous works in high dynamic range (HDR) video reconstruction uses sequence of alternating exposure LDR frames as input, and align the neighbouring frames using optical flow based networks. However, these methods often result in motion artifacts in chal… ▽ More Due to hardware constraints, standard off-the-shelf digital cameras suffers from low dynamic range (LDR) and low frame per second (FPS) outputs. Previous works in high dynamic range (HDR) video reconstruction uses sequence of alternating exposure LDR frames as input, and align the neighbouring frames using optical flow based networks. However, these methods often result in motion artifacts in challenging situations. This is because, the alternate exposure frames have to be exposure matched in order to apply alignment using optical flow. Hence, over-saturation and noise in the LDR frames results in inaccurate alignment. To this end, we propose to align the input LDR frames using a pre-trained video frame interpolation network. This results in better alignment of LDR frames, since we circumvent the error-prone exposure matching step, and directly generate intermediate missing frames from the same exposure inputs. Furthermore, it allows us to generate high FPS HDR videos by recursively interpolating the intermediate frames. Through this work, we propose to use video frame interpolation for HDR video reconstruction, and present the first method to generate high FPS HDR videos. Experimental results demonstrate the efficacy of the proposed framework against optical flow based alignment methods, with an absolute improvement of 2.4 PSNR value on standard HDR video datasets [1], [2] and further benchmark our method for high FPS HDR video generation. △ Less

Submitted 10 October, 2022; originally announced October 2022.

Comments: ICPR 2022

arXiv:2210.03168 [pdf, other]

doi 10.1109/IEMCON56893.2022.9946531

Gastrointestinal Disorder Detection with a Transformer Based Approach

Authors: A. K. M. Salman Hosain, Mynul islam, Md Humaion Kabir Mehedi, Irteza Enan Kabir, Zarin Tasnim Khan

Abstract: Accurate disease categorization using endoscopic images is a significant problem in Gastroenterology. This paper describes a technique for assisting medical diagnosis procedures and identifying gastrointestinal tract disorders based on the categorization of characteristics taken from endoscopic pictures using a vision transformer and transfer learning model. Vision transformer has shown very promi… ▽ More Accurate disease categorization using endoscopic images is a significant problem in Gastroenterology. This paper describes a technique for assisting medical diagnosis procedures and identifying gastrointestinal tract disorders based on the categorization of characteristics taken from endoscopic pictures using a vision transformer and transfer learning model. Vision transformer has shown very promising results on difficult image classification tasks. In this paper, we have suggested a vision transformer based approach to detect gastrointestianl diseases from wireless capsule endoscopy (WCE) curated images of colon with an accuracy of 95.63\%. We have compared this transformer based approach with pretrained convolutional neural network (CNN) model DenseNet201 and demonstrated that vision transformer surpassed DenseNet201 in various quantitative performance evaluation metrics. △ Less

Submitted 6 October, 2022; originally announced October 2022.

arXiv:2209.07387 [pdf, other]

Holomorphic Surface Defects in Four-Dimensional Chern-Simons Theory

Authors: Ahsan Z. Khan

Abstract: We derive the framing anomaly of four-dimensional holomorphic-topological Chern-Simons theory formulated on the product of a topological surface and the complex plane. We show that the presence of this anomaly allows one to couple four-dimensional Chern-Simons theory to holomorphic field theories with Kac-Moody symmetry, where the Kac-Moody level $k$ is critical $k=-h^{\vee}$. Applying this result… ▽ More We derive the framing anomaly of four-dimensional holomorphic-topological Chern-Simons theory formulated on the product of a topological surface and the complex plane. We show that the presence of this anomaly allows one to couple four-dimensional Chern-Simons theory to holomorphic field theories with Kac-Moody symmetry, where the Kac-Moody level $k$ is critical $k=-h^{\vee}$. Applying this result to a holomorphic sigma model into a complex coadjoint orbit, we derive that four-dimensional Chern-Simons theory admits holomorphic monodromy defects. △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: 41 pages

Showing 1–50 of 366 results for author: Khan, Z