Search | arXiv e-print repository

Video Generation with Learned Action Prior

Authors: Meenakshi Sarkar, Devansh Bhardwaj, Debasish Ghose

Abstract: Stochastic video generation is particularly challenging when the camera is mounted on a moving platform, as camera motion interacts with observed image pixels, creating complex spatio-temporal dynamics and making the problem partially observable. Existing methods typically address this by focusing on raw pixel-level image reconstruction without explicitly modelling camera motion dynamics. We propo… ▽ More Stochastic video generation is particularly challenging when the camera is mounted on a moving platform, as camera motion interacts with observed image pixels, creating complex spatio-temporal dynamics and making the problem partially observable. Existing methods typically address this by focusing on raw pixel-level image reconstruction without explicitly modelling camera motion dynamics. We propose a solution by considering camera motion or action as part of the observed image state, modelling both image and action within a multi-modal learning framework. We introduce three models: Video Generation with Learning Action Prior (VG-LeAP) treats the image-action pair as an augmented state generated from a single latent stochastic process and uses variational inference to learn the image-action latent prior; Causal-LeAP, which establishes a causal relationship between action and the observed image frame at time $t$, learning an action prior conditioned on the observed image states; and RAFI, which integrates the augmented image-action state concept into flow matching with diffusion generative processes, demonstrating that this action-conditioned image generation concept can be extended to other diffusion-based models. We emphasize the importance of multi-modal training in partially observable video generation problems through detailed empirical studies on our new video action dataset, RoAM. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2404.10204 [pdf]

doi 10.13140/RG.2.2.18751.48804

The Impact of Machine Learning on Society: An Analysis of Current Trends and Future Implications

Authors: Md Kamrul Hossain Siam, Manidipa Bhattacharjee, Shakik Mahmud, Md. Saem Sarkar, Md. Masud Rana

Abstract: The Machine learning (ML) is a rapidly evolving field of technology that has the potential to greatly impact society in a variety of ways. However, there are also concerns about the potential negative effects of ML on society, such as job displacement and privacy issues. This research aimed to conduct a comprehensive analysis of the current and future impact of ML on society. The research included… ▽ More The Machine learning (ML) is a rapidly evolving field of technology that has the potential to greatly impact society in a variety of ways. However, there are also concerns about the potential negative effects of ML on society, such as job displacement and privacy issues. This research aimed to conduct a comprehensive analysis of the current and future impact of ML on society. The research included a thorough literature review, case studies, and surveys to gather data on the economic impact of ML, ethical and privacy implications, and public perceptions of the technology. The survey was conducted on 150 respondents from different areas. The case studies conducted were on the impact of ML on healthcare, finance, transportation, and manufacturing. The findings of this research revealed that the majority of respondents have a moderate level of familiarity with the concept of ML, believe that it has the potential to benefit society, and think that society should prioritize the development and use of ML. Based on these findings, it was recommended that more research is conducted on the impact of ML on society, stronger regulations and laws to protect the privacy and rights of individuals when it comes to ML should be developed, transparency and accountability in ML decision-making processes should be increased, and public education and awareness about ML should be enhanced. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 12 pages

arXiv:2404.05439 [pdf, other]

Action-conditioned video data improves predictability

Authors: Meenakshi Sarkar, Debasish Ghose

Abstract: Long-term video generation and prediction remain challenging tasks in computer vision, particularly in partially observable scenarios where cameras are mounted on moving platforms. The interaction between observed image frames and the motion of the recording agent introduces additional complexities. To address these issues, we introduce the Action-Conditioned Video Generation (ACVG) framework, a n… ▽ More Long-term video generation and prediction remain challenging tasks in computer vision, particularly in partially observable scenarios where cameras are mounted on moving platforms. The interaction between observed image frames and the motion of the recording agent introduces additional complexities. To address these issues, we introduce the Action-Conditioned Video Generation (ACVG) framework, a novel approach that investigates the relationship between actions and generated image frames through a deep dual Generator-Actor architecture. ACVG generates video sequences conditioned on the actions of robots, enabling exploration and analysis of how vision and action mutually influence one another in dynamic environments. We evaluate the framework's effectiveness on an indoor robot motion dataset which consists of sequences of image frames along with the sequences of actions taken by the robotic agent, conducting a comprehensive empirical study comparing ACVG to other state-of-the-art frameworks along with a detailed ablation study. △ Less

Submitted 8 April, 2024; originally announced April 2024.

arXiv:2402.14702 [pdf, other]

InfFeed: Influence Functions as a Feedback to Improve the Performance of Subjective Tasks

Authors: Somnath Banerjee, Maulindu Sarkar, Punyajoy Saha, Binny Mathew, Animesh Mukherjee

Abstract: Recently, influence functions present an apparatus for achieving explainability for deep neural models by quantifying the perturbation of individual train instances that might impact a test prediction. Our objectives in this paper are twofold. First we incorporate influence functions as a feedback into the model to improve its performance. Second, in a dataset extension exercise, using influence f… ▽ More Recently, influence functions present an apparatus for achieving explainability for deep neural models by quantifying the perturbation of individual train instances that might impact a test prediction. Our objectives in this paper are twofold. First we incorporate influence functions as a feedback into the model to improve its performance. Second, in a dataset extension exercise, using influence functions to automatically identify data points that have been initially `silver' annotated by some existing method and need to be cross-checked (and corrected) by annotators to improve the model performance. To meet these objectives, in this paper, we introduce InfFeed, which uses influence functions to compute the influential instances for a target instance. Toward the first objective, we adjust the label of the target instance based on its influencer(s) label. In doing this, InfFeed outperforms the state-of-the-art baselines (including LLMs) by a maximum macro F1-score margin of almost 4% for hate speech classification, 3.5% for stance classification, and 3% for irony and 2% for sarcasm detection. Toward the second objective we show that manually re-annotating only those silver annotated data points in the extension set that have a negative influence can immensely improve the model performance bringing it very close to the scenario where all the data points in the extension set have gold labels. This allows for huge reduction of the number of data points that need to be manually annotated since out of the silver annotated extension dataset, the influence function scheme picks up ~1/1000 points that need manual correction. △ Less

Submitted 9 March, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: Accepted at LREC-COLING 2024 (Long Paper)

arXiv:2401.13968 [pdf, other]

Dynamic Long-Term Time-Series Forecasting via Meta Transformer Networks

Authors: Muhammad Anwar Ma'sum, MD Rasel Sarkar, Mahardhika Pratama, Savitha Ramasamy, Sreenatha Anavatti, Lin Liu, Habibullah, Ryszard Kowalczyk

Abstract: A reliable long-term time-series forecaster is highly demanded in practice but comes across many challenges such as low computational and memory footprints as well as robustness against dynamic learning environments. This paper proposes Meta-Transformer Networks (MANTRA) to deal with the dynamic long-term time-series forecasting tasks. MANTRA relies on the concept of fast and slow learners where a… ▽ More A reliable long-term time-series forecaster is highly demanded in practice but comes across many challenges such as low computational and memory footprints as well as robustness against dynamic learning environments. This paper proposes Meta-Transformer Networks (MANTRA) to deal with the dynamic long-term time-series forecasting tasks. MANTRA relies on the concept of fast and slow learners where a collection of fast learners learns different aspects of data distributions while adapting quickly to changes. A slow learner tailors suitable representations to fast learners. Fast adaptations to dynamic environments are achieved using the universal representation transformer layers producing task-adapted representations with a small number of parameters. Our experiments using four datasets with different prediction lengths demonstrate the advantage of our approach with at least $3\%$ improvements over the baseline algorithms for both multivariate and univariate settings. Source codes of MANTRA are publicly available in \url{https://github.com/anwarmaxsum/MANTRA}. △ Less

Submitted 25 January, 2024; originally announced January 2024.

Comments: Under Consideration in IEEE Transactions on Artificial Intelligence

arXiv:2310.17808 [pdf, other]

A Novel Fast Path Planning Approach for Mobile Devices using Hybrid Quantum Ant Colony Optimization Algorithm

Authors: Mayukh Sarkar, Jitesh Pradhan, Anil Kumar Singh, Hathiram Nenavath

Abstract: With IoT systems' increasing scale and complexity, maintenance of a large number of nodes using stationary devices is becoming increasingly difficult. Hence, mobile devices are being employed that can traverse through a set of target locations and provide the necessary services. In order to reduce energy consumption and time requirements, the devices are required to traverse following a Hamiltonia… ▽ More With IoT systems' increasing scale and complexity, maintenance of a large number of nodes using stationary devices is becoming increasingly difficult. Hence, mobile devices are being employed that can traverse through a set of target locations and provide the necessary services. In order to reduce energy consumption and time requirements, the devices are required to traverse following a Hamiltonian path. This problem can be formulated as a Travelling Salesman Problem (TSP), an NP-hard problem. Moreover, in emergency services, the devices must traverse in real-time, demanding speedy path planning from the TSP instance. Among the well-known optimization techniques for solving the TSP problem, Ant Colony Optimization has a good stronghold in providing good approximate solutions. Moreover, ACO not only provides near-optimal solutions for TSP instances but can also output optimal or near-optimal solutions for many other demanding hard optimization problems. However, to have a fast solution, the next node selection, which needs to consider all the neighbors for each selection, becomes a bottleneck in the path formation step. Moreover, classical computers are constrained to generate only pseudorandom numbers. Both these problems can be solved using quantum computing techniques, i.e., the next node can be selected with proper randomization, respecting the provided set of probabilities in just a single execution and single measurement of a quantum circuit. Simulation results of the proposed Hybrid Quantum Ant Colony Optimization algorithm on several TSP instances have shown promising results, thus expecting the proposed work to be important in implementing real-time path planning in quantum-enabled mobile devices. △ Less

Submitted 25 October, 2023; originally announced October 2023.

arXiv:2308.11239 [pdf, other]

LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and Bootstrapped Self-training

Authors: Silky Singh, Shripad Deshmukh, Mausoom Sarkar, Balaji Krishnamurthy

Abstract: Learning object segmentation in image and video datasets without human supervision is a challenging problem. Humans easily identify moving salient objects in videos using the gestalt principle of common fate, which suggests that what moves together belongs together. Building upon this idea, we propose a self-supervised object discovery approach that leverages motion and appearance information to p… ▽ More Learning object segmentation in image and video datasets without human supervision is a challenging problem. Humans easily identify moving salient objects in videos using the gestalt principle of common fate, which suggests that what moves together belongs together. Building upon this idea, we propose a self-supervised object discovery approach that leverages motion and appearance information to produce high-quality object segmentation masks. Specifically, we redesign the traditional graph cut on images to include motion information in a linear combination with appearance information to produce edge weights. Remarkably, this step produces object segmentation masks comparable to the current state-of-the-art on multiple benchmarks. To further improve performance, we bootstrap a segmentation network trained on these preliminary masks as pseudo-ground truths to learn from its own outputs via self-training. We demonstrate the effectiveness of our approach, named LOCATE, on multiple standard video object segmentation, image saliency detection, and object segmentation benchmarks, achieving results on par with and, in many cases surpassing state-of-the-art methods. We also demonstrate the transferability of our approach to novel domains through a qualitative study on in-the-wild images. Additionally, we present extensive ablation analysis to support our design choices and highlight the contribution of each component of our proposed method. △ Less

Submitted 2 December, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

Comments: Accepted to British Machine Vision Conference (BMVC) 2023

arXiv:2307.04392 [pdf, other]

FODVid: Flow-guided Object Discovery in Videos

Authors: Silky Singh, Shripad Deshmukh, Mausoom Sarkar, Rishabh Jain, Mayur Hemani, Balaji Krishnamurthy

Abstract: Segmentation of objects in a video is challenging due to the nuances such as motion blurring, parallax, occlusions, changes in illumination, etc. Instead of addressing these nuances separately, we focus on building a generalizable solution that avoids overfitting to the individual intricacies. Such a solution would also help us save enormous resources involved in human annotation of video corpora.… ▽ More Segmentation of objects in a video is challenging due to the nuances such as motion blurring, parallax, occlusions, changes in illumination, etc. Instead of addressing these nuances separately, we focus on building a generalizable solution that avoids overfitting to the individual intricacies. Such a solution would also help us save enormous resources involved in human annotation of video corpora. To solve Video Object Segmentation (VOS) in an unsupervised setting, we propose a new pipeline (FODVid) based on the idea of guiding segmentation outputs using flow-guided graph-cut and temporal consistency. Basically, we design a segmentation model incorporating intra-frame appearance and flow similarities, and inter-frame temporal continuation of the objects under consideration. We perform an extensive experimental analysis of our straightforward methodology on the standard DAVIS16 video benchmark. Though simple, our approach produces results comparable (within a range of ~2 mIoU) to the existing top approaches in unsupervised VOS. The simplicity and effectiveness of our technique opens up new avenues for research in the video domain. △ Less

Submitted 10 July, 2023; originally announced July 2023.

Comments: CVPR 2023 (L3D-IVU workshop)

arXiv:2306.16503 [pdf, other]

SARC: Soft Actor Retrospective Critic

Authors: Sukriti Verma, Ayush Chopra, Jayakumar Subramanian, Mausoom Sarkar, Nikaash Puri, Piyush Gupta, Balaji Krishnamurthy

Abstract: The two-time scale nature of SAC, which is an actor-critic algorithm, is characterised by the fact that the critic estimate has not converged for the actor at any given time, but since the critic learns faster than the actor, it ensures eventual consistency between the two. Various strategies have been introduced in literature to learn better gradient estimates to help achieve better convergence.… ▽ More The two-time scale nature of SAC, which is an actor-critic algorithm, is characterised by the fact that the critic estimate has not converged for the actor at any given time, but since the critic learns faster than the actor, it ensures eventual consistency between the two. Various strategies have been introduced in literature to learn better gradient estimates to help achieve better convergence. Since gradient estimates depend upon the critic, we posit that improving the critic can provide a better gradient estimate for the actor at each time. Utilizing this, we propose Soft Actor Retrospective Critic (SARC), where we augment the SAC critic loss with another loss term - retrospective loss - leading to faster critic convergence and consequently, better policy gradient estimates for the actor. An existing implementation of SAC can be easily adapted to SARC with minimal modifications. Through extensive experimentation and analysis, we show that SARC provides consistent improvement over SAC on benchmark environments. We plan to open-source the code and all experiment data at: https://github.com/sukritiverma1996/SARC. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: Accepted at RLDM 2022

arXiv:2306.15852 [pdf, other]

Action-conditioned Deep Visual Prediction with RoAM, a new Indoor Human Motion Dataset for Autonomous Robots

Authors: Meenakshi Sarkar, Vinayak Honkote, Dibyendu Das, Debasish Ghose

Abstract: With the increasing adoption of robots across industries, it is crucial to focus on developing advanced algorithms that enable robots to anticipate, comprehend, and plan their actions effectively in collaboration with humans. We introduce the Robot Autonomous Motion (RoAM) video dataset, which is collected with a custom-made turtlebot3 Burger robot in a variety of indoor environments recording var… ▽ More With the increasing adoption of robots across industries, it is crucial to focus on developing advanced algorithms that enable robots to anticipate, comprehend, and plan their actions effectively in collaboration with humans. We introduce the Robot Autonomous Motion (RoAM) video dataset, which is collected with a custom-made turtlebot3 Burger robot in a variety of indoor environments recording various human motions from the robot's ego-vision. The dataset also includes synchronized records of the LiDAR scan and all control actions taken by the robot as it navigates around static and moving human agents. The unique dataset provides an opportunity to develop and benchmark new visual prediction frameworks that can predict future image frames based on the action taken by the recording agent in partially observable scenarios or cases where the imaging sensor is mounted on a moving platform. We have benchmarked the dataset on our novel deep visual prediction framework called ACPNet where the approximated future image frames are also conditioned on action taken by the robot and demonstrated its potential for incorporating robot dynamics into the video prediction paradigm for mobile robotics and autonomous navigation research. △ Less

Submitted 27 June, 2023; originally announced June 2023.

arXiv:2305.17523 [pdf]

A Comparative Analysis of Portfolio Optimization Using Mean-Variance, Hierarchical Risk Parity, and Reinforcement Learning Approaches on the Indian Stock Market

Authors: Jaydip Sen, Aditya Jaiswal, Anshuman Pathak, Atish Kumar Majee, Kushagra Kumar, Manas Kumar Sarkar, Soubhik Maji

Abstract: This paper presents a comparative analysis of the performances of three portfolio optimization approaches. Three approaches of portfolio optimization that are considered in this work are the mean-variance portfolio (MVP), hierarchical risk parity (HRP) portfolio, and reinforcement learning-based portfolio. The portfolios are trained and tested over several stock data and their performances are com… ▽ More This paper presents a comparative analysis of the performances of three portfolio optimization approaches. Three approaches of portfolio optimization that are considered in this work are the mean-variance portfolio (MVP), hierarchical risk parity (HRP) portfolio, and reinforcement learning-based portfolio. The portfolios are trained and tested over several stock data and their performances are compared on their annual returns, annual risks, and Sharpe ratios. In the reinforcement learning-based portfolio design approach, the deep Q learning technique has been utilized. Due to the large number of possible states, the construction of the Q-table is done using a deep neural network. The historical prices of the 50 premier stocks from the Indian stock market, known as the NIFTY50 stocks, and several stocks from 10 important sectors of the Indian stock market are used to create the environment for training the agent. △ Less

Submitted 27 May, 2023; originally announced May 2023.

Comments: The report is 52 pages long. It is based on the capstone project done in the post graduate course of data science in Praxis Business School, Kolkata, India, of the Autumn Batch, 2022

arXiv:2303.15122 [pdf, other]

Parameter Efficient Local Implicit Image Function Network for Face Segmentation

Authors: Mausoom Sarkar, Nikitha SR, Mayur Hemani, Rishabh Jain, Balaji Krishnamurthy

Abstract: Face parsing is defined as the per-pixel labeling of images containing human faces. The labels are defined to identify key facial regions like eyes, lips, nose, hair, etc. In this work, we make use of the structural consistency of the human face to propose a lightweight face-parsing method using a Local Implicit Function network, FP-LIIF. We propose a simple architecture having a convolutional enc… ▽ More Face parsing is defined as the per-pixel labeling of images containing human faces. The labels are defined to identify key facial regions like eyes, lips, nose, hair, etc. In this work, we make use of the structural consistency of the human face to propose a lightweight face-parsing method using a Local Implicit Function network, FP-LIIF. We propose a simple architecture having a convolutional encoder and a pixel MLP decoder that uses 1/26th number of parameters compared to the state-of-the-art models and yet matches or outperforms state-of-the-art models on multiple datasets, like CelebAMask-HQ and LaPa. We do not use any pretraining, and compared to other works, our network can also generate segmentation at different resolutions without any changes in the input resolution. This work enables the use of facial segmentation on low-compute or low-bandwidth devices because of its higher FPS and smaller model size. △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: Accepted at CVPR 2023

arXiv:2302.07089 [pdf]

Novel Design of Quantum Circuits for Representation of Grayscale Images

Authors: Mayukh Sarkar

Abstract: The advent of Quantum Computing has influenced researchers around the world to solve multitudes of computational problems with the promising technology. Feasibility of solutions for computational problems, and representation of various information, may allow quantum computing to replace classical computer in near future. One such challenge is the representation of digital images in quantum compute… ▽ More The advent of Quantum Computing has influenced researchers around the world to solve multitudes of computational problems with the promising technology. Feasibility of solutions for computational problems, and representation of various information, may allow quantum computing to replace classical computer in near future. One such challenge is the representation of digital images in quantum computer. Several works have been done to make it possible. One such promising technique, named Quantum Probability Image Encoding, requires minimal number of qubits, where the intensity of n pixels is represented as the statevector of log_2(n) qubits. Though there exist quantum circuit design techniques to obtain arbitrary statevector, they consider statevector in general Hilbert space. But for image data, considering only real vector space is sufficient, that may constraint the circuit in smaller gate set, and possibly can reduce number of gates required. In this paper, construction of such quantum circuits has been proposed. △ Less

Submitted 9 February, 2023; originally announced February 2023.

arXiv:2301.06928 [pdf, other]

Towards Estimating Transferability using Hard Subsets

Authors: Tarun Ram Menta, Surgan Jandial, Akash Patil, Vimal KB, Saketh Bachu, Balaji Krishnamurthy, Vineeth N. Balasubramanian, Chirag Agarwal, Mausoom Sarkar

Abstract: As transfer learning techniques are increasingly used to transfer knowledge from the source model to the target task, it becomes important to quantify which source models are suitable for a given target task without performing computationally expensive fine tuning. In this work, we propose HASTE (HArd Subset TransfErability), a new strategy to estimate the transferability of a source model to a pa… ▽ More As transfer learning techniques are increasingly used to transfer knowledge from the source model to the target task, it becomes important to quantify which source models are suitable for a given target task without performing computationally expensive fine tuning. In this work, we propose HASTE (HArd Subset TransfErability), a new strategy to estimate the transferability of a source model to a particular target task using only a harder subset of target data. By leveraging the internal and output representations of model, we introduce two techniques, one class agnostic and another class specific, to identify harder subsets and show that HASTE can be used with any existing transferability metric to improve their reliability. We further analyze the relation between HASTE and the optimal average log likelihood as well as negative conditional entropy and empirically validate our theoretical bounds. Our experimental results across multiple source model architectures, target datasets, and transfer learning tasks show that HASTE modified metrics are consistently better or on par with the state of the art transferability metrics. △ Less

Submitted 17 January, 2023; originally announced January 2023.

Comments: First three authors contributed equally

arXiv:2211.10157 [pdf, other]

UMFuse: Unified Multi View Fusion for Human Editing applications

Authors: Rishabh Jain, Mayur Hemani, Duygu Ceylan, Krishna Kumar Singh, Jingwan Lu, Mausoom Sarkar, Balaji Krishnamurthy

Abstract: Numerous pose-guided human editing methods have been explored by the vision community due to their extensive practical applications. However, most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. This objective becomes ill-defined in cases when the target pose differs significantly from the input pose. Existing… ▽ More Numerous pose-guided human editing methods have been explored by the vision community due to their extensive practical applications. However, most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. This objective becomes ill-defined in cases when the target pose differs significantly from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse knowledge from multiple viewpoints, we design a multi-view fusion network that takes the pose key points and texture from multiple source images and generates an explainable per-pixel appearance retrieval map. Thereafter, the encodings from a separate network (trained on a single-view human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on two newly proposed tasks - Multi-view human reposing and Mix&Match Human Image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a better alternative. △ Less

Submitted 28 March, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

Comments: 8 pages, 6 figures

ACM Class: I.4; I.5

arXiv:2211.08540 [pdf, other]

VGFlow: Visibility guided Flow Network for Human Reposing

Authors: Rishabh Jain, Krishna Kumar Singh, Mayur Hemani, Jingwan Lu, Mausoom Sarkar, Duygu Ceylan, Balaji Krishnamurthy

Abstract: The task of human reposing involves generating a realistic image of a person standing in an arbitrary conceivable pose. There are multiple difficulties in generating perceptually accurate images, and existing methods suffer from limitations in preserving texture, maintaining pattern coherence, respecting cloth boundaries, handling occlusions, manipulating skin generation, etc. These difficulties a… ▽ More The task of human reposing involves generating a realistic image of a person standing in an arbitrary conceivable pose. There are multiple difficulties in generating perceptually accurate images, and existing methods suffer from limitations in preserving texture, maintaining pattern coherence, respecting cloth boundaries, handling occlusions, manipulating skin generation, etc. These difficulties are further exacerbated by the fact that the possible space of pose orientation for humans is large and variable, the nature of clothing items is highly non-rigid, and the diversity in body shape differs largely among the population. To alleviate these difficulties and synthesize perceptually accurate images, we propose VGFlow. Our model uses a visibility-guided flow module to disentangle the flow into visible and invisible parts of the target for simultaneous texture preservation and style manipulation. Furthermore, to tackle distinct body shapes and avoid network artifacts, we also incorporate a self-supervised patch-wise "realness" loss to improve the output. VGFlow achieves state-of-the-art results as observed qualitatively and quantitatively on different image quality metrics (SSIM, LPIPS, FID). △ Less

Submitted 28 March, 2023; v1 submitted 13 November, 2022; originally announced November 2022.

Comments: Selected for publication in CVPR2023

ACM Class: I.4; I.5

arXiv:2209.06584 [pdf, other]

One-Shot Doc Snippet Detection: Powering Search in Document Beyond Text

Authors: Abhinav Java, Shripad Deshmukh, Milan Aggarwal, Surgan Jandial, Mausoom Sarkar, Balaji Krishnamurthy

Abstract: Active consumption of digital documents has yielded scope for research in various applications, including search. Traditionally, searching within a document has been cast as a text matching problem ignoring the rich layout and visual cues commonly present in structured documents, forms, etc. To that end, we ask a mostly unexplored question: "Can we search for other similar snippets present in a ta… ▽ More Active consumption of digital documents has yielded scope for research in various applications, including search. Traditionally, searching within a document has been cast as a text matching problem ignoring the rich layout and visual cues commonly present in structured documents, forms, etc. To that end, we ask a mostly unexplored question: "Can we search for other similar snippets present in a target document page given a single query instance of a document snippet?". We propose MONOMER to solve this as a one-shot snippet detection task. MONOMER fuses context from visual, textual, and spatial modalities of snippets and documents to find query snippet in target documents. We conduct extensive ablations and experiments showing MONOMER outperforms several baselines from one-shot object detection (BHRL), template matching, and document understanding (LayoutLMv3). Due to the scarcity of relevant data for the task at hand, we train MONOMER on programmatically generated data having many visually similar query snippets and target document pairs from two datasets - Flamingo Forms and PubLayNet. We also do a human study to validate the generated data. △ Less

Submitted 12 September, 2022; originally announced September 2022.

arXiv:2207.02964 [pdf, other]

Mitigating shortage of labeled data using clustering-based active learning with diversity exploration

Authors: Xuyang Yan, Shabnam Nazmi, Biniam Gebru, Mohd Anwar, Abdollah Homaifar, Mrinmoy Sarkar, Kishor Datta Gupta

Abstract: In this paper, we proposed a new clustering-based active learning framework, namely Active Learning using a Clustering-based Sampling (ALCS), to address the shortage of labeled data. ALCS employs a density-based clustering approach to explore the cluster structure from the data without requiring exhaustive parameter tuning. A bi-cluster boundary-based sample query procedure is introduced to improv… ▽ More In this paper, we proposed a new clustering-based active learning framework, namely Active Learning using a Clustering-based Sampling (ALCS), to address the shortage of labeled data. ALCS employs a density-based clustering approach to explore the cluster structure from the data without requiring exhaustive parameter tuning. A bi-cluster boundary-based sample query procedure is introduced to improve the learning performance for classifying highly overlapped classes. Additionally, we developed an effective diversity exploration strategy to address the redundancy among queried samples. Our experimental results justified the efficacy of the ALCS approach. △ Less

Submitted 6 July, 2022; originally announced July 2022.

Comments: Accepted by the ICML 2022 Workshop on Adaptive Experimental Design and Active Learning in the Real World

arXiv:2206.05584 [pdf, other]

doi 10.1109/ICPS48983.2019.9067584

Modeling and Optimization of a Longitudinally-Distributed Global Solar Grid

Authors: Harsh Vardhan, Neal M Sarkar, Himanshu Neema

Abstract: Our simulation-based experiments are aimed to demonstrate a use case on the feasibility of fulfillment of global energy demand by primarily relying on solar energy through the integration of a longitudinally-distributed grid. These experiments demonstrate the availability of simulation technologies, good approximation models of grid components, and data for simulation. We also experimented with in… ▽ More Our simulation-based experiments are aimed to demonstrate a use case on the feasibility of fulfillment of global energy demand by primarily relying on solar energy through the integration of a longitudinally-distributed grid. These experiments demonstrate the availability of simulation technologies, good approximation models of grid components, and data for simulation. We also experimented with integrating different tools to create realistic simulations as we are currently developing a detailed tool-chain for experimentation. These experiments consist of a network of model houses at different locations in the world, each producing and consuming only solar energy. The model includes houses, various appliances, appliance usage schedules, regional weather information, floor area, HVAC systems, population, number of houses in the region, and other parameters to imitate a real-world scenario. Data gathered from the power system simulation is used to develop optimization models to find the optimal solar panel area required at the different locations to satisfy energy demands in different scenarios. △ Less

Submitted 11 June, 2022; originally announced June 2022.

arXiv:2203.13721 [pdf, other]

Salt Detection Using Segmentation of Seismic Image

Authors: Mrinmoy Sarkar

Abstract: In this project, a state-of-the-art deep convolution neural network (DCNN) is presented to segment seismic images for salt detection below the earth's surface. Detection of salt location is very important for starting mining. Hence, a seismic image is used to detect the exact salt location under the earth's surface. However, precisely detecting the exact location of salt deposits is difficult. The… ▽ More In this project, a state-of-the-art deep convolution neural network (DCNN) is presented to segment seismic images for salt detection below the earth's surface. Detection of salt location is very important for starting mining. Hence, a seismic image is used to detect the exact salt location under the earth's surface. However, precisely detecting the exact location of salt deposits is difficult. Therefore, professional seismic imaging still requires expert human interpretation of salt bodies. This leads to very subjective, highly variable renderings. Hence, to create the most accurate seismic images and 3D renderings, we need a robust algorithm that automatically and accurately identifies if a surface target is a salt or not. Since the performance of DCNN is well-known and well-established for object recognition in images, DCNN is a very good choice for this particular problem and being successfully applied to a dataset of seismic images in which each pixel is labeled as salt or not. The result of this algorithm is promising. △ Less

Submitted 25 March, 2022; originally announced March 2022.

arXiv:2111.13974 [pdf, other]

Exploring Transformer Based Models to Identify Hate Speech and Offensive Content in English and Indo-Aryan Languages

Authors: Somnath Banerjee, Maulindu Sarkar, Nancy Agrawal, Punyajoy Saha, Mithun Das

Abstract: Hate speech is considered to be one of the major issues currently plaguing online social media. Repeated and repetitive exposure to hate speech has been shown to create physiological effects on the target users. Thus, hate speech, in all its forms, should be addressed on these platforms in order to maintain good health. In this paper, we explored several Transformer based machine learning models f… ▽ More Hate speech is considered to be one of the major issues currently plaguing online social media. Repeated and repetitive exposure to hate speech has been shown to create physiological effects on the target users. Thus, hate speech, in all its forms, should be addressed on these platforms in order to maintain good health. In this paper, we explored several Transformer based machine learning models for the detection of hate speech and offensive content in English and Indo-Aryan languages at FIRE 2021. We explore several models such as mBERT, XLMR-large, XLMR-base by team name "Super Mario". Our models came 2nd position in Code-Mixed Data set (Macro F1: 0.7107), 2nd position in Hindi two-class classification(Macro F1: 0.7797), 4th in English four-class category (Macro F1: 0.8006) and 12th in English two-class category (Macro F1: 0.6447). △ Less

Submitted 27 November, 2021; originally announced November 2021.

Comments: Accepted in FIRE'21 (Track HASOC - English and Indo-Aryan Languages)

arXiv:2111.13157 [pdf, other]

DA$^{\textbf{2}}$-Net : Diverse & Adaptive Attention Convolutional Neural Network

Authors: Abenezer Girma, Abdollah Homaifar, M Nabil Mahmoud, Xuyang Yan, Mrinmoy Sarkar

Abstract: Standard Convolutional Neural Network (CNN) designs rarely focus on the importance of explicitly capturing diverse features to enhance the network's performance. Instead, most existing methods follow an indirect approach of increasing or tuning the networks' depth and width, which in many cases significantly increases the computational cost. Inspired by a biological visual system, we propose a Div… ▽ More Standard Convolutional Neural Network (CNN) designs rarely focus on the importance of explicitly capturing diverse features to enhance the network's performance. Instead, most existing methods follow an indirect approach of increasing or tuning the networks' depth and width, which in many cases significantly increases the computational cost. Inspired by a biological visual system, we propose a Diverse and Adaptive Attention Convolutional Network (DA$^{2}$-Net), which enables any feed-forward CNNs to explicitly capture diverse features and adaptively select and emphasize the most informative features to efficiently boost the network's performance. DA$^{2}$-Net incurs negligible computational overhead and it is designed to be easily integrated with any CNN architecture. We extensively evaluated DA$^{2}$-Net on benchmark datasets, including CIFAR100, SVHN, and ImageNet, with various CNN architectures. The experimental results show DA$^{2}$-Net provides a significant performance improvement with very minimal computational overhead. △ Less

Submitted 25 November, 2021; originally announced November 2021.

arXiv:2111.11692 [pdf, other]

Status-quo policy gradient in Multi-Agent Reinforcement Learning

Authors: Pinkesh Badjatiya, Mausoom Sarkar, Nikaash Puri, Jayakumar Subramanian, Abhishek Sinha, Siddharth Singh, Balaji Krishnamurthy

Abstract: Individual rationality, which involves maximizing expected individual returns, does not always lead to high-utility individual or group outcomes in multi-agent problems. For instance, in multi-agent social dilemmas, Reinforcement Learning (RL) agents trained to maximize individual rewards converge to a low-utility mutually harmful equilibrium. In contrast, humans evolve useful strategies in such s… ▽ More Individual rationality, which involves maximizing expected individual returns, does not always lead to high-utility individual or group outcomes in multi-agent problems. For instance, in multi-agent social dilemmas, Reinforcement Learning (RL) agents trained to maximize individual rewards converge to a low-utility mutually harmful equilibrium. In contrast, humans evolve useful strategies in such social dilemmas. Inspired by ideas from human psychology that attribute this behavior to the status-quo bias, we present a status-quo loss (SQLoss) and the corresponding policy gradient algorithm that incorporates this bias in an RL agent. We demonstrate that agents trained with SQLoss learn high-utility policies in several social dilemma matrix games (Prisoner's Dilemma, Stag Hunt matrix variant, Chicken Game). We show how SQLoss outperforms existing state-of-the-art methods to obtain high-utility policies in visual input non-matrix games (Coin Game and Stag Hunt visual input variant) using pre-trained cooperation and defection oracles. Finally, we show that SQLoss extends to a 4-agent setting by demonstrating the emergence of cooperative behavior in the popular Braess' paradox. △ Less

Submitted 23 November, 2021; originally announced November 2021.

arXiv:2111.08169 [pdf, other]

A Supervised Feature Selection Method For Mixed-Type Data using Density-based Feature Clustering

Authors: Xuyang Yan, Mrinmoy Sarkar, Biniam Gebru, Shabnam Nazmi, Abdollah Homaifar

Abstract: Feature selection methods are widely used to address the high computational overheads and curse of dimensionality in classifying high-dimensional data. Most conventional feature selection methods focus on handling homogeneous features, while real-world datasets usually have a mixture of continuous and discrete features. Some recent mixed-type feature selection studies only select features with hig… ▽ More Feature selection methods are widely used to address the high computational overheads and curse of dimensionality in classifying high-dimensional data. Most conventional feature selection methods focus on handling homogeneous features, while real-world datasets usually have a mixture of continuous and discrete features. Some recent mixed-type feature selection studies only select features with high relevance to class labels and ignore the redundancy among features. The determination of an appropriate feature subset is also a challenge. In this paper, a supervised feature selection method using density-based feature clustering (SFSDFC) is proposed to obtain an appropriate final feature subset for mixed-type data. SFSDFC decomposes the feature space into a set of disjoint feature clusters using a novel density-based clustering method. Then, an effective feature selection strategy is employed to obtain a subset of important features with minimal redundancy from those feature clusters. Extensive experiments as well as comparison studies with five state-of-the-art methods are conducted on SFSDFC using thirteen real-world benchmark datasets and results justify the efficacy of the SFSDFC method. △ Less

Submitted 10 November, 2021; originally announced November 2021.

Comments: 6 pages, 3 figures, 4 tables, accepted by the IEEE SMC 2021

arXiv:2111.05413 [pdf, other]

A Framework for eVTOL Performance Evaluation in Urban Air Mobility Realm

Authors: Mrinmoy Sarkar, Xuyang Yan, Abenezer Girma, Abdollah Homaifar

Abstract: In this paper, we developed a generalized simulation framework for the evaluation of electric vertical takeoff and landing vehicles (eVTOLs) in the context of Unmanned Aircraft Systems (UAS) Traffic Management (UTM) and under the concept of Urban Air Mobility (UAM). Unlike most existing studies, the proposed framework combines the utilization of UTM and eVTOLs to develop a realistic UAM testing pl… ▽ More In this paper, we developed a generalized simulation framework for the evaluation of electric vertical takeoff and landing vehicles (eVTOLs) in the context of Unmanned Aircraft Systems (UAS) Traffic Management (UTM) and under the concept of Urban Air Mobility (UAM). Unlike most existing studies, the proposed framework combines the utilization of UTM and eVTOLs to develop a realistic UAM testing platform. For this purpose, we first enhanced an existing UTM simulator to simulate the real-world UAM environment. Then, instead of using a simplified eVOTL model, a realistic eVTOL design tool, namely SUAVE, is employed and an dilation sub-module is introduced to bridge the gap between the UTM simulator and SUAVE eVTOL performance evaluation tool to elaborate the complete mission profile. Based on the developed simulation framework, experiments are conducted and the results are presented to analyze the performance of eVTOLs in the UAM environment. △ Less

Submitted 9 November, 2021; originally announced November 2021.

Comments: 7 pages, 9 figures, Submitted to ICRA 2022 conference

arXiv:2109.03813 [pdf, other]

Video2Skill: Adapting Events in Demonstration Videos to Skills in an Environment using Cyclic MDP Homomorphisms

Authors: Sumedh A Sontakke, Sumegh Roychowdhury, Mausoom Sarkar, Nikaash Puri, Balaji Krishnamurthy, Laurent Itti

Abstract: Humans excel at learning long-horizon tasks from demonstrations augmented with textual commentary, as evidenced by the burgeoning popularity of tutorial videos online. Intuitively, this capability can be separated into 2 distinct subtasks - first, dividing a long-horizon demonstration sequence into semantically meaningful events; second, adapting such events into meaningful behaviors in one's own… ▽ More Humans excel at learning long-horizon tasks from demonstrations augmented with textual commentary, as evidenced by the burgeoning popularity of tutorial videos online. Intuitively, this capability can be separated into 2 distinct subtasks - first, dividing a long-horizon demonstration sequence into semantically meaningful events; second, adapting such events into meaningful behaviors in one's own environment. Here, we present Video2Skill (V2S), which attempts to extend this capability to artificial agents by allowing a robot arm to learn from human cooking videos. We first use sequence-to-sequence Auto-Encoder style architectures to learn a temporal latent space for events in long-horizon demonstrations. We then transfer these representations to the robotic target domain, using a small amount of offline and unrelated interaction data (sequences of state-action pairs of the robot arm controlled by an expert) to adapt these events into actionable representations, i.e., skills. Through experiments, we demonstrate that our approach results in self-supervised analogy learning, where the agent learns to draw analogies between motions in human demonstration data and behaviors in the robotic environment. We also demonstrate the efficacy of our approach on model learning - demonstrating how Video2Skill utilizes prior knowledge from human demonstration to outperform traditional model learning of long-horizon dynamics. Finally, we demonstrate the utility of our approach for non-tabula rasa decision-making, i.e, utilizing video demonstration for zero-shot skill generation. △ Less

Submitted 9 September, 2021; v1 submitted 8 September, 2021; originally announced September 2021.

arXiv:2107.04419 [pdf, other]

Form2Seq : A Framework for Higher-Order Form Structure Extraction

Authors: Milan Aggarwal, Hiresh Gupta, Mausoom Sarkar, Balaji Krishnamurthy

Abstract: Document structure extraction has been a widely researched area for decades with recent works performing it as a semantic segmentation task over document images using fully-convolution networks. Such methods are limited by image resolution due to which they fail to disambiguate structures in dense regions which appear commonly in forms. To mitigate this, we propose Form2Seq, a novel sequence-to-se… ▽ More Document structure extraction has been a widely researched area for decades with recent works performing it as a semantic segmentation task over document images using fully-convolution networks. Such methods are limited by image resolution due to which they fail to disambiguate structures in dense regions which appear commonly in forms. To mitigate this, we propose Form2Seq, a novel sequence-to-sequence (Seq2Seq) inspired framework for structure extraction using text, with a specific focus on forms, which leverages relative spatial arrangement of structures. We discuss two tasks; 1) Classification of low-level constituent elements (TextBlock and empty fillable Widget) into ten types such as field captions, list items, and others; 2) Grouping lower-level elements into higher-order constructs, such as Text Fields, ChoiceFields and ChoiceGroups, used as information collection mechanism in forms. To achieve this, we arrange the constituent elements linearly in natural reading order, feed their spatial and textual representations to Seq2Seq framework, which sequentially outputs prediction of each element depending on the final task. We modify Seq2Seq for grouping task and discuss improvements obtained through cascaded end-to-end training of two tasks versus training in isolation. Experimental results show the effectiveness of our text-based approach achieving an accuracy of 90% on classification task and an F1 of 75.82, 86.01, 61.63 on groups discussed above respectively, outperforming segmentation baselines. Further we show our framework achieves state of the results for table structure recognition on ICDAR 2013 dataset. △ Less

Submitted 9 July, 2021; originally announced July 2021.

Comments: This paper has been presented at EMNLP 2020

arXiv:2107.04396 [pdf, other]

Multi-Modal Association based Grouping for Form Structure Extraction

Authors: Milan Aggarwal, Mausoom Sarkar, Hiresh Gupta, Balaji Krishnamurthy

Abstract: Document structure extraction has been a widely researched area for decades. Recent work in this direction has been deep learning-based, mostly focusing on extracting structure using fully convolution NN through semantic segmentation. In this work, we present a novel multi-modal approach for form structure extraction. Given simple elements such as textruns and widgets, we extract higher-order stru… ▽ More Document structure extraction has been a widely researched area for decades. Recent work in this direction has been deep learning-based, mostly focusing on extracting structure using fully convolution NN through semantic segmentation. In this work, we present a novel multi-modal approach for form structure extraction. Given simple elements such as textruns and widgets, we extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups, which are essential for information collection in forms. To achieve this, we obtain a local image patch around each low-level element (reference) by identifying candidate elements closest to it. We process textual and spatial representation of candidates sequentially through a BiLSTM to obtain context-aware representations and fuse them with image patch features obtained by processing it through a CNN. Subsequently, the sequential decoder takes this fused feature vector to predict the association type between reference and candidates. These predicted associations are utilized to determine larger structures through connected components analysis. Experimental results show the effectiveness of our approach achieving a recall of 90.29%, 73.80%, 83.12%, and 52.72% for the above structures, respectively, outperforming semantic segmentation baselines significantly. We show the efficacy of our method through ablations, comparing it against using individual modalities. We also introduce our new rich human-annotated Forms Dataset. △ Less

Submitted 9 July, 2021; originally announced July 2021.

Comments: This work has been accepted and presented at WACV 2020

arXiv:2106.11823 [pdf, other]

A Clustering-based Framework for Classifying Data Streams

Authors: Xuyang Yan, Abdollah Homaifar, Mrinmoy Sarkar, Abenezer Girma, Edward Tunstel

Abstract: The non-stationary nature of data streams strongly challenges traditional machine learning techniques. Although some solutions have been proposed to extend traditional machine learning techniques for handling data streams, these approaches either require an initial label set or rely on specialized design parameters. The overlap among classes and the labeling of data streams constitute other major… ▽ More The non-stationary nature of data streams strongly challenges traditional machine learning techniques. Although some solutions have been proposed to extend traditional machine learning techniques for handling data streams, these approaches either require an initial label set or rely on specialized design parameters. The overlap among classes and the labeling of data streams constitute other major challenges for classifying data streams. In this paper, we proposed a clustering-based data stream classification framework to handle non-stationary data streams without utilizing an initial label set. A density-based stream clustering procedure is used to capture novel concepts with a dynamic threshold and an effective active label querying strategy is introduced to continuously learn the new concepts from the data streams. The sub-cluster structure of each cluster is explored to handle the overlap among classes. Experimental results and quantitative comparison studies reveal that the proposed method provides statistically better or comparable performance than the existing methods. △ Less

Submitted 22 June, 2021; originally announced June 2021.

Comments: This paper has been accepted by IJCAI 2021

arXiv:2104.09571 [pdf, ps, other]

Medium Access Strategies for Integrated Access and Backhaul at mmWaves Unlicensed Spectrum

Authors: Biswa P. S. Sahoo, Styabrata Swain, Hung-Yu Wei, Mahasweta Sarkar

Abstract: The unlicensed spectrum is recently considered one of the defining solutions to meet the steadily growing traffic demand. This, in turn, has led to the enhancement for LTE in Release-13 to enable Licensed-Assisted Access (LAA) operations. The design of the medium access control (MAC) protocol for the LAA system to harmonically coexist with the incumbent WLAN system operating in an unlicensed band… ▽ More The unlicensed spectrum is recently considered one of the defining solutions to meet the steadily growing traffic demand. This, in turn, has led to the enhancement for LTE in Release-13 to enable Licensed-Assisted Access (LAA) operations. The design of the medium access control (MAC) protocol for the LAA system to harmonically coexist with the incumbent WLAN system operating in an unlicensed band is critical and challenging. In this paper, we consider an Integrated Access and Backhaul (IAB) system coexisting with a Wi-Fi network operating at millimeter-wave (mmWave) unlicensed spectrum, for which a listen-before-talk-based (LBT) based medium access mechanism is carefully designed. Additionally, we have considered an in-band system that supports both access and backhaul in a single node where the small-cell or the IAB nodes compete with the WiGig for medium access. We present comprehensive experimental results and give design insights based on the simulation results. △ Less

Submitted 22 March, 2021; originally announced April 2021.

Comments: 6 pages, 6 figures, conference paper, Accepted for publication in Wireless Telecommunications Symposium (WTS), San Francisco, USA, April 2021

arXiv:2010.02556 [pdf, other]

SHERLock: Self-Supervised Hierarchical Event Representation Learning

Authors: Sumegh Roychowdhury, Sumedh A. Sontakke, Nikaash Puri, Mausoom Sarkar, Milan Aggarwal, Pinkesh Badjatiya, Balaji Krishnamurthy, Laurent Itti

Abstract: Temporal event representations are an essential aspect of learning among humans. They allow for succinct encoding of the experiences we have through a variety of sensory inputs. Also, they are believed to be arranged hierarchically, allowing for an efficient representation of complex long-horizon experiences. Additionally, these representations are acquired in a self-supervised manner. Analogously… ▽ More Temporal event representations are an essential aspect of learning among humans. They allow for succinct encoding of the experiences we have through a variety of sensory inputs. Also, they are believed to be arranged hierarchically, allowing for an efficient representation of complex long-horizon experiences. Additionally, these representations are acquired in a self-supervised manner. Analogously, here we propose a model that learns temporal representations from long-horizon visual demonstration data and associated textual descriptions, without explicit temporal supervision. Our method produces a hierarchy of representations that align more closely with ground-truth human-annotated events (+15.3) than state-of-the-art unsupervised baselines. Our results are comparable to heavily-supervised baselines in complex visual domains such as Chess Openings, YouCook2 and TutorialVQA datasets. Finally, we perform ablation studies illustrating the robustness of our approach. We release our code and demo visualizations in the Supplementary Material. △ Less

Submitted 22 August, 2022; v1 submitted 6 October, 2020; originally announced October 2020.

Comments: Accepted at ICPR '22

arXiv:2009.01485 [pdf, other]

SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval

Authors: Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, Balaji Krishnamurthy

Abstract: The ability to efficiently search for images is essential for improving the user experiences across various products. Incorporating user feedback, via multi-modal inputs, to navigate visual search can help tailor retrieved results to specific user queries. We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images tha… ▽ More The ability to efficiently search for images is essential for improving the user experiences across various products. Incorporating user feedback, via multi-modal inputs, to navigate visual search can help tailor retrieved results to specific user queries. We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images that concurrently satisfy constraints imposed by both inputs. The task is challenging since it requires learning composite image-text features by incorporating multiple cross-granular semantic edits from text feedback and then applying the same to visual features. To address this, we propose a novel framework SAC which resolves the above in two major steps: "where to see" (Semantic Feature Attention) and "how to change" (Semantic Feature Modification). We systematically show how our architecture streamlines the generation of text-aware image features by removing the need for various modules required by other state-of-art techniques. We present extensive quantitative, qualitative analysis, and ablation studies, to show that our architecture SAC outperforms existing techniques by achieving state-of-the-art performance on 3 benchmark datasets: FashionIQ, Shoes, and Birds-to-Words, while supporting natural language feedback of varying lengths. △ Less

Submitted 19 October, 2021; v1 submitted 3 September, 2020; originally announced September 2020.

Comments: Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, and Ayush Chopra contributed equally to this work. Work accepted at WACV 2022

arXiv:2006.13593 [pdf, other]

Retrospective Loss: Looking Back to Improve Training of Deep Neural Networks

Authors: Surgan Jandial, Ayush Chopra, Mausoom Sarkar, Piyush Gupta, Balaji Krishnamurthy, Vineeth Balasubramanian

Abstract: Deep neural networks (DNNs) are powerful learning machines that have enabled breakthroughs in several domains. In this work, we introduce a new retrospective loss to improve the training of deep neural network models by utilizing the prior experience available in past model states during training. Minimizing the retrospective loss, along with the task-specific loss, pushes the parameter state at t… ▽ More Deep neural networks (DNNs) are powerful learning machines that have enabled breakthroughs in several domains. In this work, we introduce a new retrospective loss to improve the training of deep neural network models by utilizing the prior experience available in past model states during training. Minimizing the retrospective loss, along with the task-specific loss, pushes the parameter state at the current training step towards the optimal parameter state while pulling it away from the parameter state at a previous training step. Although a simple idea, we analyze the method as well as to conduct comprehensive sets of experiments across domains - images, speech, text, and graphs - to show that the proposed loss results in improved performance across input domains, tasks, and architectures. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: Accepted at KDD 2020; The first two authors contributed equally

arXiv:2001.05458 [pdf, other]

Inducing Cooperative behaviour in Sequential-Social dilemmas through Multi-Agent Reinforcement Learning using Status-Quo Loss

Authors: Pinkesh Badjatiya, Mausoom Sarkar, Abhishek Sinha, Siddharth Singh, Nikaash Puri, Jayakumar Subramanian, Balaji Krishnamurthy

Abstract: In social dilemma situations, individual rationality leads to sub-optimal group outcomes. Several human engagements can be modeled as a sequential (multi-step) social dilemmas. However, in contrast to humans, Deep Reinforcement Learning agents trained to optimize individual rewards in sequential social dilemmas converge to selfish, mutually harmful behavior. We introduce a status-quo loss (SQLoss)… ▽ More In social dilemma situations, individual rationality leads to sub-optimal group outcomes. Several human engagements can be modeled as a sequential (multi-step) social dilemmas. However, in contrast to humans, Deep Reinforcement Learning agents trained to optimize individual rewards in sequential social dilemmas converge to selfish, mutually harmful behavior. We introduce a status-quo loss (SQLoss) that encourages an agent to stick to the status quo, rather than repeatedly changing its policy. We show how agents trained with SQLoss evolve cooperative behavior in several social dilemma matrix games. To work with social dilemma games that have visual input, we propose GameDistill. GameDistill uses self-supervision and clustering to automatically extract cooperative and selfish policies from a social dilemma game. We combine GameDistill and SQLoss to show how agents evolve socially desirable cooperative behavior in the Coin Game. △ Less

Submitted 13 February, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

arXiv:1912.09428 [pdf, other]

Location Forensics Analysis Using ENF Sequences Extracted from Power and Audio Recordings

Authors: Dhiman Chowdhury, Mrinmoy Sarkar

Abstract: Electrical network frequency (ENF) is the signature of a power distribution grid which represents the nominal frequency (50 or 60 Hz) of a power system network. Due to load variations in a power grid, ENF sequences experience fluctuations. These ENF variations are inherently located in a multimedia signal which is recorded close to the grid or directly from the mains power line. Therefore, a multi… ▽ More Electrical network frequency (ENF) is the signature of a power distribution grid which represents the nominal frequency (50 or 60 Hz) of a power system network. Due to load variations in a power grid, ENF sequences experience fluctuations. These ENF variations are inherently located in a multimedia signal which is recorded close to the grid or directly from the mains power line. Therefore, a multimedia recording can be localized by analyzing the ENF sequences of that signal in absence of the concurrent power signal. In this paper, a novel approach to analyze location forensics using ENF sequences extracted from a number of power and audio recordings is proposed. The digital recordings are collected from different grid locations around the world. Potential feature components are determined from the ENF sequences. Then, a multi-class support vector machine (SVM) classification model is developed to validate the location authenticity of the recordings. The performance assessments affirm the efficacy of the presented work. △ Less

Submitted 18 December, 2019; originally announced December 2019.

Comments: 5 pages, 5 figures, conference paper

arXiv:1911.12170 [pdf, other]

Document Structure Extraction using Prior based High Resolution Hierarchical Semantic Segmentation

Authors: Mausoom Sarkar, Milan Aggarwal, Arneh Jain, Hiresh Gupta, Balaji Krishnamurthy

Abstract: Structure extraction from document images has been a long-standing research topic due to its high impact on a wide range of practical applications. In this paper, we share our findings on employing a hierarchical semantic segmentation network for this task of structure extraction. We propose a prior based deep hierarchical CNN network architecture that enables document structure extraction using v… ▽ More Structure extraction from document images has been a long-standing research topic due to its high impact on a wide range of practical applications. In this paper, we share our findings on employing a hierarchical semantic segmentation network for this task of structure extraction. We propose a prior based deep hierarchical CNN network architecture that enables document structure extraction using very high resolution(1800 x 1000) images. We divide the document image into overlapping horizontal strips such that the network segments a strip and uses its prediction mask as prior for predicting the segmentation of the subsequent strip. We perform experiments establishing the effectiveness of our strip based network architecture through ablation methods and comparison with low-resolution variations. Further, to demonstrate our network's capabilities, we train it on only one type of documents (Forms) and achieve state-of-the-art results over other general document datasets. We introduce our new human-annotated forms dataset and show that our method significantly outperforms different segmentation baselines on this dataset in extracting hierarchical structures. Our method is currently being used in Adobe's AEM Forms for automated conversion of paper and PDF forms to modern HTML based forms. △ Less

Submitted 17 September, 2020; v1 submitted 27 November, 2019; originally announced November 2019.

Comments: This work has been accepted at ECCV 2020

arXiv:1909.00237 [pdf, ps, other]

Triclustering of Gene Expression Microarray Data Using Coarse-Grained Parallel Genetic Algorithm

Authors: Shubhankar Mohapatra, Moumita Sarkar, Anjali Mohapatra, Bhawani Sankar Biswal

Abstract: Microarray data analysis is one of the major area of research in the field computational biology. Numerous techniques like clustering, biclustering are often applied to microarray data to extract meaningful outcomes which play key roles in practical healthcare affairs like disease identification, drug discovery etc. But these techniques become obsolete when time as an another factor is considered… ▽ More Microarray data analysis is one of the major area of research in the field computational biology. Numerous techniques like clustering, biclustering are often applied to microarray data to extract meaningful outcomes which play key roles in practical healthcare affairs like disease identification, drug discovery etc. But these techniques become obsolete when time as an another factor is considered for evaluation in such data. This problem motivates to use triclustering method on gene expression 3D microarray data. In this article, a new methodology based on coarse-grained parallel genetic approach is proposed to locate meaningful triclusters in gene expression data. The outcomes are quite impressive as they are more effective as compared to traditional state of the art genetic approaches previously applied for triclustering of 3D GCT microarray data. △ Less

Submitted 31 August, 2019; originally announced September 2019.

Journal ref: Springer Lecture Notes in Networks and Systems 2016 - 2020

arXiv:1906.10182 [pdf, other]

Planning Robot Motion using Deep Visual Prediction

Authors: Meenakshi Sarkar, Prabhu Pradhan, Debasish Ghose

Abstract: In this paper, we introduce a novel framework that can learn to make visual predictions about the motion of a robotic agent from raw video frames. Our proposed motion prediction network (PROM-Net) can learn in a completely unsupervised manner and efficiently predict up to 10 frames in the future. Moreover, unlike any other motion prediction models, it is lightweight and once trained it can be easi… ▽ More In this paper, we introduce a novel framework that can learn to make visual predictions about the motion of a robotic agent from raw video frames. Our proposed motion prediction network (PROM-Net) can learn in a completely unsupervised manner and efficiently predict up to 10 frames in the future. Moreover, unlike any other motion prediction models, it is lightweight and once trained it can be easily implemented on mobile platforms that have very limited computing capabilities. We have created a new robotic data set comprising LEGO Mindstorms moving along various trajectories in three different environments under different lighting conditions for testing and training the network. Finally, we introduce a framework that would use the predicted frames from the network as an input to a model predictive controller for motion planning in unknown dynamic environments with moving obstacles. △ Less

Submitted 24 June, 2019; originally announced June 2019.

Comments: 7th ICAPS Workshop on Planning and Robotics (PlanRob), 2019

arXiv:1810.05394 [pdf, other]

Sequential Learning of Movement Prediction in Dynamic Environments using LSTM Autoencoder

Authors: Meenakshi Sarkar, Debasish Ghose

Abstract: Predicting movement of objects while the action of learning agent interacts with the dynamics of the scene still remains a key challenge in robotics. We propose a multi-layer Long Short Term Memory (LSTM) autoendocer network that predicts future frames for a robot navigating in a dynamic environment with moving obstacles. The autoencoder network is composed of a state and action conditioned decode… ▽ More Predicting movement of objects while the action of learning agent interacts with the dynamics of the scene still remains a key challenge in robotics. We propose a multi-layer Long Short Term Memory (LSTM) autoendocer network that predicts future frames for a robot navigating in a dynamic environment with moving obstacles. The autoencoder network is composed of a state and action conditioned decoder network that reconstructs the future frames of video, conditioned on the action taken by the agent. The input image frames are first transformed into low dimensional feature vectors with a pre-trained encoder network and then reconstructed with the LSTM autoencoder network to generate the future frames. A virtual environment, based on the OpenAi-Gym framework for robotics, is used to gather training data and test the proposed network. The initial experiments show promising results indicating that these predicted frames can be used by an appropriate reinforcement learning framework in future to navigate around dynamic obstacles. △ Less

Submitted 12 October, 2018; originally announced October 2018.

Comments: 4 pages

MSC Class: 68T05

arXiv:1808.09390 [pdf]

doi 10.5772/intechopen.70907

Internet of Things: Technology, Applications and Standardardization

Authors: Jaydip Sen, Moonkun Lee, Sunghyeon Lee, Yeongbok Choe, Menachem Domb, Arpan Pal, Hemant Kumar Rath, Samar Shailendra, Abhijan Bhattacharyya, Albena Mihovska, Mahasweta Sarkar, Hyun Jung Lee, Myungho Kim, Alexandru Averian

Abstract: The term "Internet of Things" (IoT) refers to an ecosystem of interconnected physical objects and devices that are accessible through the Internet and can communicate with each other. The main strength of the IoT vision is the high impact it has created and will continue to do so on several aspects of the everyday life and behavior of its potential users. This book presents some of the state-of-th… ▽ More The term "Internet of Things" (IoT) refers to an ecosystem of interconnected physical objects and devices that are accessible through the Internet and can communicate with each other. The main strength of the IoT vision is the high impact it has created and will continue to do so on several aspects of the everyday life and behavior of its potential users. This book presents some of the state-of-the-art research work in the field of the IoT, especially on the issues of communication protocols, interoperability of protocols and semantics, trust security and privacy issues, reference architecture design, and standardization. It will be a valuable source of knowledge for researchers, engineers, practitioners, and graduate and doctoral students who are working in various fields of the IoT. It will also be useful for faculty members of graduate schools and universities. △ Less

Submitted 25 August, 2018; originally announced August 2018.

Comments: The book contains 137 pages. It is published by IntechOpen, London, United Kingdom in August 2018.Print ISBN 978-1-78923-548-7, Online ISBN 978-1-78923-549-4

arXiv:1805.00223 [pdf, other]

Localization: A Missing Link in the Pipeline of Object Matching and Registration

Authors: Deepak Mishra, Rajeev Ranjan, Santanu Chaudhury, Mukul Sarkar, Arvinder Singh Soin

Abstract: Image registration is a process of aligning two or more images of same objects using geometric transformation. Most of the existing approaches work on the assumption of location invariance. These approaches require object-centric images to perform matching. Further, in absence of intensity level symmetry between the corresponding points in two images, the learning based registration approaches rel… ▽ More Image registration is a process of aligning two or more images of same objects using geometric transformation. Most of the existing approaches work on the assumption of location invariance. These approaches require object-centric images to perform matching. Further, in absence of intensity level symmetry between the corresponding points in two images, the learning based registration approaches rely on synthetic deformations, which often fail in real scenarios. To address these issues, a combination of convolutional neural networks (CNNs) to perform the desired registration is developed in this work. The complete objective is divided into three sub-objectives: object localization, segmentation and matching transformation. Object localization step establishes an initial correspondence between the images. A modified version of single shot multi-box detector is used for this purpose. The detected region is cropped to make the images object-centric. Subsequently, the objects are segmented and matched using a spatial transformer network employing thin plate spline deformation. Initial experiments on MNIST and Caltech-101 datasets show that the proposed model is able to produce accurate matching. Quantitative evaluation performed using dice coefficient (DC) and mean intersection over union (mIoU) show that proposed method results in the values of 79% and 66%, respectively for MNIST dataset and the values of 94% and 90%, respectively for Caltech-101 dataset. The proposed framework is extended to the registration of CT and US images, which is free from any data specific assumptions and has better generalization capability as compared to the existing rule-based/classical approaches. △ Less

Submitted 11 January, 2019; v1 submitted 1 May, 2018; originally announced May 2018.

Comments: 11 pages, 6 figures

arXiv:1804.08454 [pdf, other]

Attention Based Natural Language Grounding by Navigating Virtual Environment

Authors: Akilesh B, Abhishek Sinha, Mausoom Sarkar, Balaji Krishnamurthy

Abstract: In this work, we focus on the problem of grounding language by training an agent to follow a set of natural language instructions and navigate to a target object in an environment. The agent receives visual information through raw pixels and a natural language instruction telling what task needs to be achieved and is trained in an end-to-end way. We develop an attention mechanism for multi-modal f… ▽ More In this work, we focus on the problem of grounding language by training an agent to follow a set of natural language instructions and navigate to a target object in an environment. The agent receives visual information through raw pixels and a natural language instruction telling what task needs to be achieved and is trained in an end-to-end way. We develop an attention mechanism for multi-modal fusion of visual and textual modalities that allows the agent to learn to complete the task and achieve language grounding. Our experimental results show that our attention mechanism outperforms the existing multi-modal fusion mechanisms proposed for both 2D and 3D environments in order to solve the above-mentioned task in terms of both speed and success rate. We show that the learnt textual representations are semantically meaningful as they follow vector arithmetic in the embedding space. The effectiveness of our attention approach over the contemporary fusion mechanisms is also highlighted from the textual embeddings learnt by the different approaches. We also show that our model generalizes effectively to unseen scenarios and exhibit zero-shot generalization capabilities both in 2D and 3D environments. The code for our 2D environment as well as the models that we developed for both 2D and 3D are available at https://github.com/rl-lang-grounding/rl-lang-ground. △ Less

Submitted 21 December, 2018; v1 submitted 23 April, 2018; originally announced April 2018.

Comments: Accepted at WACV 2019. Also at NeurIPS 2017 workshop on Visually-Grounded Interaction and Language (ViGIL)

arXiv:1801.03318 [pdf, other]

Unsupervised Despeckling

Authors: Deepak Mishra, Santanu Chaudhury, Mukul Sarkar, Arvinder Singh Soin

Abstract: Contrast and quality of ultrasound images are adversely affected by the excessive presence of speckle. However, being an inherent imaging property, speckle helps in tissue characterization and tracking. Thus, despeckling of the ultrasound images requires the reduction of speckle extent without any oversmoothing. In this letter, we aim to address the despeckling problem using an unsupervised deep a… ▽ More Contrast and quality of ultrasound images are adversely affected by the excessive presence of speckle. However, being an inherent imaging property, speckle helps in tissue characterization and tracking. Thus, despeckling of the ultrasound images requires the reduction of speckle extent without any oversmoothing. In this letter, we aim to address the despeckling problem using an unsupervised deep adversarial approach. A despeckling residual neural network (DRNN) is trained with an adversarial loss imposed by a discriminator. The discriminator tries to differentiate between the despeckled images generated by the DRNN and the set of high-quality images. Further to prevent the developed DRNN from oversmoothing, a structural loss term is used along with the adversarial loss. Experimental evaluations show that the proposed DRNN is able to outperform the state-of-the-art despeckling approaches. △ Less

Submitted 10 January, 2018; originally announced January 2018.

arXiv:1705.02338 [pdf, other]

Exploiting OxRAM Resistive Switching for Dynamic Range Improvement of CMOS Image Sensors

Authors: Ashwani Kumar, Mukul Sarkar, Manan Suri

Abstract: We present a unique application of OxRAM devices in CMOS Image Sensors (CIS) for dynamic range (DR) improvement. We propose a modified 3T-APS (Active Pixel Sensor) circuit that incorporates OxRAM in 1T-1R configuration. DR improvement is achieved by resistive compression of the pixel output signal through autonomous programming of OxRAM device resistance during exposure. We show that by carefully… ▽ More We present a unique application of OxRAM devices in CMOS Image Sensors (CIS) for dynamic range (DR) improvement. We propose a modified 3T-APS (Active Pixel Sensor) circuit that incorporates OxRAM in 1T-1R configuration. DR improvement is achieved by resistive compression of the pixel output signal through autonomous programming of OxRAM device resistance during exposure. We show that by carefully preconditioning the OxRAM resistance, pixel DR can be enhanced. Detailed impact of OxRAM SET-to-RESET and RESET-to-SET transitions on pixel DR is discussed. For experimental validation with specific OxRAM preprogrammed states, a 4 Kb 10 nm thick HfOx (1T-1R) matrix was fabricated and characterized. Best case, relative pixel DR improvement of ~ 50 dB was obtained for our design. △ Less

Submitted 5 May, 2017; originally announced May 2017.

arXiv:1704.04959 [pdf, other]

Introspection: Accelerating Neural Network Training By Learning Weight Evolution

Authors: Abhishek Sinha, Mausoom Sarkar, Aahitagni Mukherjee, Balaji Krishnamurthy

Abstract: Neural Networks are function approximators that have achieved state-of-the-art accuracy in numerous machine learning tasks. In spite of their great success in terms of accuracy, their large training time makes it difficult to use them for various tasks. In this paper, we explore the idea of learning weight evolution pattern from a simple network for accelerating training of novel neural networks.… ▽ More Neural Networks are function approximators that have achieved state-of-the-art accuracy in numerous machine learning tasks. In spite of their great success in terms of accuracy, their large training time makes it difficult to use them for various tasks. In this paper, we explore the idea of learning weight evolution pattern from a simple network for accelerating training of novel neural networks. We use a neural network to learn the training pattern from MNIST classification and utilize it to accelerate training of neural networks used for CIFAR-10 and ImageNet classification. Our method has a low memory footprint and is computationally efficient. This method can also be used with other optimizers to give faster convergence. The results indicate a general trend in the weight evolution during training of neural networks. △ Less

Submitted 17 April, 2017; originally announced April 2017.

arXiv:1306.0089 [pdf]

doi 10.1145/2490302.2490304

A Novel Reconfigurable Architecture of a DSP Processor for Efficient Mapping of DSP Functions using Field Programmable DSP Arrays

Authors: Amitabha Sinha, Mitrava Sarkar, Soumojit Acharyya, Suranjan Chakraborty

Abstract: Development of modern integrated circuit technologies makes it feasible to develop cheaper, faster and smaller special purpose signal processing function circuits. Digital Signal processing functions are generally implemented either on ASICs with inflexibility, or on FPGAs with bottlenecks of relatively smaller utilization factor or lower speed compared to ASIC. Field Programmable DSP Array (FPDA)… ▽ More Development of modern integrated circuit technologies makes it feasible to develop cheaper, faster and smaller special purpose signal processing function circuits. Digital Signal processing functions are generally implemented either on ASICs with inflexibility, or on FPGAs with bottlenecks of relatively smaller utilization factor or lower speed compared to ASIC. Field Programmable DSP Array (FPDA) is the proposed DSP dedicated device, redolent to FPGA, but with basic fixed common modules (CMs) (like adders, subtractors, multipliers, scaling units, shifters) instead of CLBs. This paper introduces the development of reconfigurable system architecture with a focus on FPDA that integrates different DSP functions like DFT, FFT, DCT, FIR, IIR, and DWT etc. The switching between DSP functions is occurred by reconfiguring the interconnection between CMs. Validation of the proposed architecture has been achieved on Virtex5 FPGA. The architecture provides sufficient amount of flexibility, parallelism and scalability. △ Less

Submitted 1 June, 2013; originally announced June 2013.

Comments: 8 Pages, 12 Figures, ACM SIGARCH Computer Architecture News. arXiv admin note: substantial text overlap with arXiv:1305.3251

MSC Class: 68R01

Journal ref: ACM SIGARCH Computer Architecture News, Volume 41 Issue 2, May 2013, Pages 1-8

arXiv:1305.3251 [pdf]

doi 10.5121/sipij.2013.4204

Field Programmable DSP Arrays - A Novel Reconfigurable Architecture for Efficient Realization of Digital Signal Processing Functions

Authors: Amitabha Sinha, Soumojit Acharyya, Suranjan Chakraborty, Mitrava Sarkar

Abstract: Digital Signal Processing functions are widely used in real time high speed applications. Those functions are generally implemented either on ASICs with inflexibility, or on FPGAs with bottlenecks of relatively smaller utilization factor or lower speed compared to ASIC. The proposed reconfigurable DSP processor is redolent to FPGA, but with basic fixed Common Modules (CMs) (like adders, subtractor… ▽ More Digital Signal Processing functions are widely used in real time high speed applications. Those functions are generally implemented either on ASICs with inflexibility, or on FPGAs with bottlenecks of relatively smaller utilization factor or lower speed compared to ASIC. The proposed reconfigurable DSP processor is redolent to FPGA, but with basic fixed Common Modules (CMs) (like adders, subtractors, multipliers, scaling units, shifters) instead of CLBs. This paper introduces the development of a reconfigurable DSP processor that integrates different filter and transform functions. The switching between DSP functions is occurred by reconfiguring the interconnection between CMs. Validation of the proposed reconfigurable architecture has been achieved on Virtex5 FPGA. The architecture provides sufficient amount of flexibility, parallelism and scalability. △ Less

Submitted 13 May, 2013; originally announced May 2013.

Comments: 18 pages, 17 figures. This paper has been published into Signal & Image Processing : An International Journal (SIPIJ - AIRCC) Vol.4, No.2, April 2013. http://airccse.org/journal/sipij/current2013.html

Journal ref: Signal & Image Processing : An International Journal (SIPIJ - AIRCC) Vol.4, No.2, April 2013

Showing 1–47 of 47 results for author: Sarkar, M