Search | arXiv e-print repository

Knowledge-enhanced Relation Graph and Task Sampling for Few-shot Molecular Property Prediction

Authors: Zeyu Wang, Tianyi Jiang, Yao Lu, Xiaoze Bao, Shanqing Yu, Bin Wei, Qi Xuan

Abstract: Recently, few-shot molecular property prediction (FSMPP) has garnered increasing attention. Despite impressive breakthroughs achieved by existing methods, they often overlook the inherent many-to-many relationships between molecules and properties, which limits their performance. For instance, similar substructures of molecules can inspire the exploration of new compounds. Additionally, the relati… ▽ More Recently, few-shot molecular property prediction (FSMPP) has garnered increasing attention. Despite impressive breakthroughs achieved by existing methods, they often overlook the inherent many-to-many relationships between molecules and properties, which limits their performance. For instance, similar substructures of molecules can inspire the exploration of new compounds. Additionally, the relationships between properties can be quantified, with high-related properties providing more information in exploring the target property than those low-related. To this end, this paper proposes a novel meta-learning FSMPP framework (KRGTS), which comprises the Knowledge-enhanced Relation Graph module and the Task Sampling module. The knowledge-enhanced relation graph module constructs the molecule-property multi-relation graph (MPMRG) to capture the many-to-many relationships between molecules and properties. The task sampling module includes a meta-training task sampler and an auxiliary task sampler, responsible for scheduling the meta-training process and sampling high-related auxiliary tasks, respectively, thereby achieving efficient meta-knowledge learning and reducing noise introduction. Empirically, extensive experiments on five datasets demonstrate the superiority of KRGTS over a variety of state-of-the-art methods. The code is available in https://github.com/Vencent-Won/KRGTS-public. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2404.14755 [pdf, other]

SkinGEN: an Explainable Dermatology Diagnosis-to-Generation Framework with Interactive Vision-Language Models

Authors: Bo Lin, Yingjing Xu, Xuanwen Bao, Zhou Zhao, Zuyong Zhang, Zhouyang Wang, Jie Zhang, Shuiguang Deng, Jianwei Yin

Abstract: With the continuous advancement of vision language models (VLMs) technology, remarkable research achievements have emerged in the dermatology field, the fourth most prevalent human disease category. However, despite these advancements, VLM still faces "hallucination" in dermatological diagnosis, and due to the inherent complexity of dermatological conditions, existing tools offer relatively limite… ▽ More With the continuous advancement of vision language models (VLMs) technology, remarkable research achievements have emerged in the dermatology field, the fourth most prevalent human disease category. However, despite these advancements, VLM still faces "hallucination" in dermatological diagnosis, and due to the inherent complexity of dermatological conditions, existing tools offer relatively limited support for user comprehension. We propose SkinGEN, a diagnosis-to-generation framework that leverages the stable diffusion (SD) method to generate reference demonstrations from diagnosis results provided by VLM, thereby enhancing the visual explainability for users. Through extensive experiments with Low-Rank Adaptation (LoRA), we identify optimal strategies for skin condition image generation. We conduct a user study with 32 participants evaluating both the system performance and explainability. Results demonstrate that SkinGEN significantly improves users' comprehension of VLM predictions and fosters increased trust in the diagnostic process. This work paves the way for more transparent and user-centric VLM applications in dermatology and beyond. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2404.05673 [pdf, other]

CoReS: Orchestrating the Dance of Reasoning and Segmentation

Authors: Xiaoyi Bao, Siyang Sun, Shuailei Ma, Kecheng Zheng, Yuxin Guo, Guosheng Zhao, Yun Zheng, Xingang Wang

Abstract: The reasoning segmentation task, which demands a nuanced comprehension of intricate queries to accurately pinpoint object regions, is attracting increasing attention. However, Multi-modal Large Language Models (MLLM) often find it difficult to accurately localize the objects described in complex reasoning contexts. We believe that the act of reasoning segmentation should mirror the cognitive stage… ▽ More The reasoning segmentation task, which demands a nuanced comprehension of intricate queries to accurately pinpoint object regions, is attracting increasing attention. However, Multi-modal Large Language Models (MLLM) often find it difficult to accurately localize the objects described in complex reasoning contexts. We believe that the act of reasoning segmentation should mirror the cognitive stages of human visual search, where each step is a progressive refinement of thought toward the final object. Thus we introduce the Chains of Reasoning and Segmenting (CoReS) and find this top-down visual hierarchy indeed enhances the visual search process. Specifically, we propose a dual-chain structure that generates multi-modal, chain-like outputs to aid the segmentation process. Furthermore, to steer the MLLM's outputs into this intended hierarchy, we incorporate in-context inputs as guidance. Extensive experiments demonstrate the superior performance of our CoReS, which surpasses the state-of-the-art method by 6.5\% on the ReasonSeg dataset. Project: https://chain-of-reasoning-and-segmentation.github.io/. △ Less

Submitted 10 July, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted at ECCV 2024

arXiv:2403.06845 [pdf, other]

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

Authors: Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Xinze Chen, Guan Huang, Xiaoyi Bao, Xingang Wang

Abstract: World models have demonstrated superiority in autonomous driving, particularly in the generation of multi-view driving videos. However, significant challenges still exist in generating customized driving videos. In this paper, we propose DriveDreamer-2, which builds upon the framework of DriveDreamer and incorporates a Large Language Model (LLM) to generate user-defined driving videos. Specificall… ▽ More World models have demonstrated superiority in autonomous driving, particularly in the generation of multi-view driving videos. However, significant challenges still exist in generating customized driving videos. In this paper, we propose DriveDreamer-2, which builds upon the framework of DriveDreamer and incorporates a Large Language Model (LLM) to generate user-defined driving videos. Specifically, an LLM interface is initially incorporated to convert a user's query into agent trajectories. Subsequently, a HDMap, adhering to traffic regulations, is generated based on the trajectories. Ultimately, we propose the Unified Multi-View Model to enhance temporal and spatial coherence in the generated driving videos. DriveDreamer-2 is the first world model to generate customized driving videos, it can generate uncommon driving videos (e.g., vehicles abruptly cut in) in a user-friendly manner. Besides, experimental results demonstrate that the generated videos enhance the training of driving perception methods (e.g., 3D detection and tracking). Furthermore, video generation quality of DriveDreamer-2 surpasses other state-of-the-art methods, showcasing FID and FVD scores of 11.2 and 55.7, representing relative improvements of 30% and 50%. △ Less

Submitted 11 April, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: Project Page: https://drivedreamer2.github.io

arXiv:2403.01203 [pdf, other]

Pseudo-Label Calibration Semi-supervised Multi-Modal Entity Alignment

Authors: Luyao Wang, Pengnian Qi, Xigang Bao, Chunlai Zhou, Biao Qin

Abstract: Multi-modal entity alignment (MMEA) aims to identify equivalent entities between two multi-modal knowledge graphs for integration. Unfortunately, prior arts have attempted to improve the interaction and fusion of multi-modal information, which have overlooked the influence of modal-specific noise and the usage of labeled and unlabeled data in semi-supervised settings. In this work, we introduce a… ▽ More Multi-modal entity alignment (MMEA) aims to identify equivalent entities between two multi-modal knowledge graphs for integration. Unfortunately, prior arts have attempted to improve the interaction and fusion of multi-modal information, which have overlooked the influence of modal-specific noise and the usage of labeled and unlabeled data in semi-supervised settings. In this work, we introduce a Pseudo-label Calibration Multi-modal Entity Alignment (PCMEA) in a semi-supervised way. Specifically, in order to generate holistic entity representations, we first devise various embedding modules and attention mechanisms to extract visual, structural, relational, and attribute features. Different from the prior direct fusion methods, we next propose to exploit mutual information maximization to filter the modal-specific noise and to augment modal-invariant commonality. Then, we combine pseudo-label calibration with momentum-based contrastive learning to make full use of the labeled and unlabeled data, which improves the quality of pseudo-label and pulls aligned entities closer. Finally, extensive experiments on two MMEA datasets demonstrate the effectiveness of our PCMEA, which yields state-of-the-art performance. △ Less

Submitted 2 March, 2024; originally announced March 2024.

Comments: accepted by AAAI2024

arXiv:2312.11570 [pdf, other]

Understanding the Multi-modal Prompts of the Pre-trained Vision-Language Model

Authors: Shuailei Ma, Chen-Wei Xie, Ying Wei, Siyang Sun, Jiaqi Fan, Xiaoyi Bao, Yuxin Guo, Yun Zheng

Abstract: Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. However, there is no work that provides a comprehensive explanation for the working mechanism of the multi-modal prompts. In this paper, we conduct a direct analysis of the multi-modal prompts by asking the following questions: $(i)$ How do the learned multi-moda… ▽ More Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. However, there is no work that provides a comprehensive explanation for the working mechanism of the multi-modal prompts. In this paper, we conduct a direct analysis of the multi-modal prompts by asking the following questions: $(i)$ How do the learned multi-modal prompts improve the recognition performance? $(ii)$ What do the multi-modal prompts learn? To answer these questions, we begin by isolating the component of the formula where the prompt influences the calculation of self-attention at each layer in two distinct ways, \ie, $(1)$ introducing prompt embeddings makes the $[cls]$ token focus on foreground objects. $(2)$ the prompts learn a bias term during the update of token embeddings, allowing the model to adapt to the target domain. Subsequently, we conduct extensive visualization and statistical experiments on the eleven diverse downstream recognition datasets. From the experiments, we reveal that the learned prompts improve the performance mainly through the second way, which acts as the dataset bias to improve the recognition performance of the pre-trained model on the corresponding dataset. Meanwhile, we propose the bias tuning way to validate our finding. With a deeper understanding of the multi-modal prompt, we hope our work can inspire new and solid research in this direction. △ Less

Submitted 11 March, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

Comments: We find that the statistical information in Figure 2 neglect the statistics for tSOS, so we make corrections. Additionally, we change the statistical samples to those where CLIP misidentify, but prompt tuning identify correctly. At the same time, we also revise some of the descriptions. The changes to the supplementary materials will be updated shortly. arXiv admin note: text overlap with arXiv:2307.06948 by other authors

arXiv:2312.06474 [pdf, other]

Relevant Intrinsic Feature Enhancement Network for Few-Shot Semantic Segmentation

Authors: Xiaoyi Bao, Jie Qin, Siyang Sun, Yun Zheng, Xingang Wang

Abstract: For few-shot semantic segmentation, the primary task is to extract class-specific intrinsic information from limited labeled data. However, the semantic ambiguity and inter-class similarity of previous methods limit the accuracy of pixel-level foreground-background classification. To alleviate these issues, we propose the Relevant Intrinsic Feature Enhancement Network (RiFeNet). To improve the sem… ▽ More For few-shot semantic segmentation, the primary task is to extract class-specific intrinsic information from limited labeled data. However, the semantic ambiguity and inter-class similarity of previous methods limit the accuracy of pixel-level foreground-background classification. To alleviate these issues, we propose the Relevant Intrinsic Feature Enhancement Network (RiFeNet). To improve the semantic consistency of foreground instances, we propose an unlabeled branch as an efficient data utilization method, which teaches the model how to extract intrinsic features robust to intra-class differences. Notably, during testing, the proposed unlabeled branch is excluded without extra unlabeled data and computation. Furthermore, we extend the inter-class variability between foreground and background by proposing a novel multi-level prototype generation and interaction module. The different-grained complementarity between global and local prototypes allows for better distinction between similar categories. The qualitative and quantitative performance of RiFeNet surpasses the state-of-the-art methods on PASCAL-5i and COCO benchmarks. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: Accepted in AAAI 2024

arXiv:2310.04780 [pdf, other]

IPMix: Label-Preserving Data Augmentation Method for Training Robust Classifiers

Authors: Zhenglin Huang, Xiaoan Bao, Na Zhang, Qingqi Zhang, Xiaomei Tu, Biao Wu, Xi Yang

Abstract: Data augmentation has been proven effective for training high-accuracy convolutional neural network classifiers by preventing overfitting. However, building deep neural networks in real-world scenarios requires not only high accuracy on clean data but also robustness when data distributions shift. While prior methods have proposed that there is a trade-off between accuracy and robustness, we propo… ▽ More Data augmentation has been proven effective for training high-accuracy convolutional neural network classifiers by preventing overfitting. However, building deep neural networks in real-world scenarios requires not only high accuracy on clean data but also robustness when data distributions shift. While prior methods have proposed that there is a trade-off between accuracy and robustness, we propose IPMix, a simple data augmentation approach to improve robustness without hurting clean accuracy. IPMix integrates three levels of data augmentation (image-level, patch-level, and pixel-level) into a coherent and label-preserving technique to increase the diversity of training data with limited computational overhead. To further improve the robustness, IPMix introduces structural complexity at different levels to generate more diverse images and adopts the random mixing method for multi-scale information fusion. Experiments demonstrate that IPMix outperforms state-of-the-art corruption robustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also significantly improves the other safety measures, including robustness to adversarial perturbations, calibration, prediction consistency, and anomaly detection, achieving state-of-the-art or comparable results on several benchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O. △ Less

Submitted 13 March, 2024; v1 submitted 7 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023

arXiv:2308.12231 [pdf, other]

SPPNet: A Single-Point Prompt Network for Nuclei Image Segmentation

Authors: Qing Xu, Wenwei Kuang, Zeyu Zhang, Xueyao Bao, Haoran Chen, Wenting Duan

Abstract: Image segmentation plays an essential role in nuclei image analysis. Recently, the segment anything model has made a significant breakthrough in such tasks. However, the current model exists two major issues for cell segmentation: (1) the image encoder of the segment anything model involves a large number of parameters. Retraining or even fine-tuning the model still requires expensive computationa… ▽ More Image segmentation plays an essential role in nuclei image analysis. Recently, the segment anything model has made a significant breakthrough in such tasks. However, the current model exists two major issues for cell segmentation: (1) the image encoder of the segment anything model involves a large number of parameters. Retraining or even fine-tuning the model still requires expensive computational resources. (2) in point prompt mode, points are sampled from the center of the ground truth and more than one set of points is expected to achieve reliable performance, which is not efficient for practical applications. In this paper, a single-point prompt network is proposed for nuclei image segmentation, called SPPNet. We replace the original image encoder with a lightweight vision transformer. Also, an effective convolutional block is added in parallel to extract the low-level semantic information from the image and compensate for the performance degradation due to the small image encoder. We propose a new point-sampling method based on the Gaussian kernel. The proposed model is evaluated on the MoNuSeg-2018 dataset. The result demonstrated that SPPNet outperforms existing U-shape architectures and shows faster convergence in training. Compared to the segment anything model, SPPNet shows roughly 20 times faster inference, with 1/70 parameters and computational cost. Particularly, only one set of points is required in both the training and inference phases, which is more reasonable for clinical applications. The code for our work and more technical details can be found at https://github.com/xq141839/SPPNet. △ Less

Submitted 23 August, 2023; originally announced August 2023.

arXiv:2308.10155 [pdf, other]

Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection

Authors: Guodong Wang, Yunhong Wang, Jie Qin, Dongming Zhang, Xiuguo Bao, Di Huang

Abstract: Anomaly detection (AD), aiming to find samples that deviate from the training distribution, is essential in safety-critical applications. Though recent self-supervised learning based attempts achieve promising results by creating virtual outliers, their training objectives are less faithful to AD which requires a concentrated inlier distribution as well as a dispersive outlier distribution. In thi… ▽ More Anomaly detection (AD), aiming to find samples that deviate from the training distribution, is essential in safety-critical applications. Though recent self-supervised learning based attempts achieve promising results by creating virtual outliers, their training objectives are less faithful to AD which requires a concentrated inlier distribution as well as a dispersive outlier distribution. In this paper, we propose Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation (UniCon-HA), taking into account both the requirements above. Specifically, we explicitly encourage the concentration of inliers and the dispersion of virtual outliers via supervised and unsupervised contrastive losses, respectively. Considering that standard contrastive data augmentation for generating positive views may induce outliers, we additionally introduce a soft mechanism to re-weight each augmented inlier according to its deviation from the inlier distribution, to ensure a purified concentration. Moreover, to prompt a higher concentration, inspired by curriculum learning, we adopt an easy-to-hard hierarchical augmentation strategy and perform contrastive aggregation at different depths of the network based on the strengths of data augmentation. Our method is evaluated under three AD settings including unlabeled one-class, unlabeled multi-class, and labeled multi-class, demonstrating its consistent superiority over other competitors. △ Less

Submitted 20 August, 2023; originally announced August 2023.

Comments: Accepted by ICCV'2023

arXiv:2308.09678 [pdf, other]

PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

Authors: Hanbing Liu, Jun-Yan He, Zhi-Qi Cheng, Wangmeng Xiang, Qize Yang, Wenhao Chai, Gaoang Wang, Xu Bao, Bin Luo, Yifeng Geng, Xuansong Xie

Abstract: Existing 3D human pose estimators face challenges in adapting to new datasets due to the lack of 2D-3D pose pairs in training sets. To overcome this issue, we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge this data disparity gap in target domain. Typically, PoSynDA uses a diffusion-inspired structure to… ▽ More Existing 3D human pose estimators face challenges in adapting to new datasets due to the lack of 2D-3D pose pairs in training sets. To overcome this issue, we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge this data disparity gap in target domain. Typically, PoSynDA uses a diffusion-inspired structure to simulate 3D pose distribution in the target domain. By incorporating a multi-hypothesis network, PoSynDA generates diverse pose hypotheses and aligns them with the target domain. To do this, it first utilizes target-specific source augmentation to obtain the target domain distribution data from the source domain by decoupling the scale and position parameters. The process is then further refined through the teacher-student paradigm and low-rank adaptation. With extensive comparison of benchmarks such as Human3.6M and MPI-INF-3DHP, PoSynDA demonstrates competitive performance, even comparable to the target-trained MixSTE model\cite{zhang2022mixste}. This work paves the way for the practical application of 3D human pose estimation in unseen domains. The code is available at https://github.com/hbing-l/PoSynDA. △ Less

Submitted 16 October, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted to ACM Multimedia 2023; 10 pages, 4 figures, 8 tables; the code is at https://github.com/hbing-l/PoSynDA

arXiv:2306.08925 [pdf, other]

Opinion Tree Parsing for Aspect-based Sentiment Analysis

Authors: Xiaoyi Bao, Xiaotong Jiang, Zhongqing Wang, Yue Zhang, Guodong Zhou

Abstract: Extracting sentiment elements using pre-trained generative models has recently led to large improvements in aspect-based sentiment analysis benchmarks. However, these models always need large-scale computing resources, and they also ignore explicit modeling of structure between sentiment elements. To address these challenges, we propose an opinion tree parsing model, aiming to parse all the sentim… ▽ More Extracting sentiment elements using pre-trained generative models has recently led to large improvements in aspect-based sentiment analysis benchmarks. However, these models always need large-scale computing resources, and they also ignore explicit modeling of structure between sentiment elements. To address these challenges, we propose an opinion tree parsing model, aiming to parse all the sentiment elements from an opinion tree, which is much faster, and can explicitly reveal a more comprehensive and complete aspect-level sentiment structure. In particular, we first introduce a novel context-free opinion grammar to normalize the opinion tree structure. We then employ a neural chart-based opinion tree parser to fully explore the correlations among sentiment elements and parse them into an opinion tree structure. Extensive experiments show the superiority of our proposed model and the capacity of the opinion tree parser with the proposed context-free opinion grammar. More importantly, the results also prove that our model is much faster than previous models. △ Less

Submitted 15 June, 2023; originally announced June 2023.

arXiv:2305.16437 [pdf, other]

KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration

Authors: Xu Bao, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Wangmeng Xiang, Jingdong Sun, Hanbing Liu, Wei Liu, Bin Luo, Yifeng Geng, Xuansong Xie

Abstract: Accurate facial landmark detection is critical for facial analysis tasks, yet prevailing heatmap and coordinate regression methods grapple with prohibitive computational costs and quantization errors. Through comprehensive theoretical analysis and experimentation, we identify and elucidate the limitations of existing techniques. To overcome these challenges, we pioneer the application of True-Rang… ▽ More Accurate facial landmark detection is critical for facial analysis tasks, yet prevailing heatmap and coordinate regression methods grapple with prohibitive computational costs and quantization errors. Through comprehensive theoretical analysis and experimentation, we identify and elucidate the limitations of existing techniques. To overcome these challenges, we pioneer the application of True-Range Multilateration, originally devised for GPS localization, to facial landmark detection. We propose KeyPoint Positioning System (KeyPosS) - the first framework to deduce exact landmark coordinates by triangulating distances between points of interest and anchor points predicted by a fully convolutional network. A key advantage of KeyPosS is its plug-and-play nature, enabling flexible integration into diverse decoding pipelines. Extensive experiments on four datasets demonstrate state-of-the-art performance, with KeyPosS outperforming existing methods in low-resolution settings despite minimal computational overhead. By spearheading the integration of Multilateration with facial analysis, KeyPosS marks a paradigm shift in facial landmark detection. The code is available at https://github.com/zhiqic/KeyPosS. △ Less

Submitted 23 September, 2023; v1 submitted 25 May, 2023; originally announced May 2023.

Comments: Accepted to ACM Multimedia 2023; 10 pages, 7 figures, 6 tables; the code is at https://github.com/zhiqic/KeyPosS

arXiv:2305.08360 [pdf, other]

Improving ChatGPT Prompt for Code Generation

Authors: Chao Liu, Xuanlin Bao, Hongyu Zhang, Neng Zhang, Haibo Hu, Xiaohong Zhang, Meng Yan

Abstract: Automated code generation can be a powerful technique for software development, significantly reducing developers' efforts and time required to create new code by generating it automatically based on requirements. Recently, OpenAI's language model ChatGPT has emerged as a powerful tool for generating human-like responses to a wide range of textual inputs (i.e., prompts), including those related to… ▽ More Automated code generation can be a powerful technique for software development, significantly reducing developers' efforts and time required to create new code by generating it automatically based on requirements. Recently, OpenAI's language model ChatGPT has emerged as a powerful tool for generating human-like responses to a wide range of textual inputs (i.e., prompts), including those related to code generation. However, the effectiveness of ChatGPT for code generation is not well understood, and the generation performance could be heavily influenced by the choice of prompt. To answer these questions, we conducted experiments using the CodeXGlue dataset to evaluate ChatGPT's capabilities for two code generation tasks, including text-to-code and code-to-code generation. We designed prompts by leveraging the chain-of-thought strategy with multi-step optimizations. Our results showed that by carefully designing prompts to guide ChatGPT, the generation performance can be improved substantially. We also analyzed the factors that influenced the prompt design and provided insights that could guide future research. △ Less

Submitted 15 May, 2023; originally announced May 2023.

Comments: 12 pages, 1 figure

arXiv:2211.06239 [pdf, other]

A monitoring framework for deployed machine learning models with supply chain examples

Authors: Bradley Eck, Duygu Kabakci-Zorlu, Yan Chen, France Savard, Xiaowei Bao

Abstract: Actively monitoring machine learning models during production operations helps ensure prediction quality and detection and remediation of unexpected or undesired conditions. Monitoring models already deployed in big data environments brings the additional challenges of adding monitoring in parallel to the existing modelling workflow and controlling resource requirements. In this paper, we describe… ▽ More Actively monitoring machine learning models during production operations helps ensure prediction quality and detection and remediation of unexpected or undesired conditions. Monitoring models already deployed in big data environments brings the additional challenges of adding monitoring in parallel to the existing modelling workflow and controlling resource requirements. In this paper, we describe (1) a framework for monitoring machine learning models; and, (2) its implementation for a big data supply chain application. We use our implementation to study drift in model features, predictions, and performance on three real data sets. We compare hypothesis test and information theoretic approaches to drift detection in features and predictions using the Kolmogorov-Smirnov distance and Bhattacharyya coefficient. Results showed that model performance was stable over the evaluation period. Features and predictions showed statistically significant drifts; however, these drifts were not linked to changes in model performance during the time of our study. △ Less

Submitted 11 November, 2022; originally announced November 2022.

Comments: 8 pages, 9 figures, IEEE Big Data 2022

arXiv:2210.15511 [pdf, other]

ProContEXT: Exploring Progressive Context Transformer for Tracking

Authors: Jin-Peng Lan, Zhi-Qi Cheng, Jun-Yan He, Chenyang Li, Bin Luo, Xu Bao, Wangmeng Xiang, Yifeng Geng, Xuansong Xie

Abstract: Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template. This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames. To this end, we revamped the tracking framework with Progressive Context Encoding Transformer Tracker (ProContEXT), which coherently exploits spatial and… ▽ More Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template. This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames. To this end, we revamped the tracking framework with Progressive Context Encoding Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories. Specifically, ProContEXT leverages a context-aware self-attention module to encode the spatial and temporal context, refining and updating the multi-scale static and dynamic templates to progressively perform accurately tracking. It explores the complementary between spatial and temporal context, raising a new pathway to multi-context modeling for transformer-based trackers. In addition, ProContEXT revised the token pruning technique to reduce computational complexity. Extensive experiments on popular benchmark datasets such as GOT-10k and TrackingNet demonstrate that the proposed ProContEXT achieves state-of-the-art performance. △ Less

Submitted 30 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted at ICASSP 2023, source code is at https://github.com/zhiqic/ProContEXT

arXiv:2208.04897 [pdf, other]

Sports Video Analysis on Large-Scale Data

Authors: Dekun Wu, He Zhao, Xingce Bao, Richard P. Wildes

Abstract: This paper investigates the modeling of automated machine description on sports video, which has seen much progress recently. Nevertheless, state-of-the-art approaches fall quite short of capturing how human experts analyze sports scenes. There are several major reasons: (1) The used dataset is collected from non-official providers, which naturally creates a gap between models trained on those dat… ▽ More This paper investigates the modeling of automated machine description on sports video, which has seen much progress recently. Nevertheless, state-of-the-art approaches fall quite short of capturing how human experts analyze sports scenes. There are several major reasons: (1) The used dataset is collected from non-official providers, which naturally creates a gap between models trained on those datasets and real-world applications; (2) previously proposed methods require extensive annotation efforts (i.e., player and ball segmentation at pixel level) on localizing useful visual features to yield acceptable results; (3) very few public datasets are available. In this paper, we propose a novel large-scale NBA dataset for Sports Video Analysis (NSVA) with a focus on captioning, to address the above challenges. We also design a unified approach to process raw videos into a stack of meaningful features with minimum labelling efforts, showing that cross modeling on such features using a transformer architecture leads to strong performance. In addition, we demonstrate the broad application of NSVA by addressing two additional tasks, namely fine-grained sports action recognition and salient player identification. Code and dataset are available at https://github.com/jackwu502/NSVA. △ Less

Submitted 9 August, 2022; originally announced August 2022.

arXiv:2208.04718 [pdf, other]

doi 10.1016/j.compbiomed.2022.106417

Improving COVID-19 CT Classification of CNNs by Learning Parameter-Efficient Representation

Authors: Yujia Xu, Hak-Keung Lam, Guangyu Jia, Jian Jiang, Junkai Liao, Xinqi Bao

Abstract: COVID-19 pandemic continues to spread rapidly over the world and causes a tremendous crisis in global human health and the economy. Its early detection and diagnosis are crucial for controlling the further spread. Many deep learning-based methods have been proposed to assist clinicians in automatic COVID-19 diagnosis based on computed tomography imaging. However, challenges still remain, including… ▽ More COVID-19 pandemic continues to spread rapidly over the world and causes a tremendous crisis in global human health and the economy. Its early detection and diagnosis are crucial for controlling the further spread. Many deep learning-based methods have been proposed to assist clinicians in automatic COVID-19 diagnosis based on computed tomography imaging. However, challenges still remain, including low data diversity in existing datasets, and unsatisfied detection resulting from insufficient accuracy and sensitivity of deep learning models. To enhance the data diversity, we design augmentation techniques of incremental levels and apply them to the largest open-access benchmark dataset, COVIDx CT-2A. Meanwhile, similarity regularization (SR) derived from contrastive learning is proposed in this study to enable CNNs to learn more parameter-efficient representations, thus improving the accuracy and sensitivity of CNNs. The results on seven commonly used CNNs demonstrate that CNN performance can be improved stably through applying the designed augmentation and SR techniques. In particular, DenseNet121 with SR achieves an average test accuracy of 99.44% in three trials for three-category classification, including normal, non-COVID-19 pneumonia, and COVID-19 pneumonia. And the achieved precision, sensitivity, and specificity for the COVID-19 pneumonia category are 98.40%, 99.59%, and 99.50%, respectively. These statistics suggest that our method has surpassed the existing state-of-the-art methods on the COVIDx CT-2A dataset. △ Less

Submitted 9 August, 2022; originally announced August 2022.

arXiv:2208.03128 [pdf, other]

Time-Frequency Distributions of Heart Sound Signals: A Comparative Study using Convolutional Neural Networks

Authors: Xinqi Bao, Yujia Xu, Hak-Keung Lam, Mohamed Trabelsi, Ines Chihi, Lilia Sidhom, Ernest N. Kamavuako

Abstract: Time-Frequency Distributions (TFDs) support the heart sound characterisation and classification in early cardiac screening. However, despite the frequent use of TFDs in signal analysis, no study comprehensively compared their performances on deep learning for automatic diagnosis. Furthermore, the combination of signal processing methods as inputs for Convolutional Neural Networks (CNNs) has been p… ▽ More Time-Frequency Distributions (TFDs) support the heart sound characterisation and classification in early cardiac screening. However, despite the frequent use of TFDs in signal analysis, no study comprehensively compared their performances on deep learning for automatic diagnosis. Furthermore, the combination of signal processing methods as inputs for Convolutional Neural Networks (CNNs) has been proved as a practical approach to increasing signal classification performance. Therefore, this study aimed to investigate the optimal use of TFD/ combined TFDs as input for CNNs. The presented results revealed that: 1) The transformation of the heart sound signal into the TF domain achieves higher classification performance than using of raw signals. Among the TFDs, the difference in the performance was slight for all the CNN models (within $1.3\%$ in average accuracy). However, Continuous wavelet transform (CWT) and Chirplet transform (CT) outperformed the rest. 2) The appropriate increase of the CNN capacity and architecture optimisation can improve the performance, while the network architecture should not be overly complicated. Based on the ResNet or SEResNet family results, the increase in the number of parameters and the depth of the structure do not improve the performance apparently. 3) Combining TFDs as CNN inputs did not significantly improve the classification results. The findings of this study provided the knowledge for selecting TFDs as CNN input and designing CNN architecture for heart sound classification. △ Less

Submitted 5 August, 2022; originally announced August 2022.

arXiv:2207.10172 [pdf, other]

Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles

Authors: Guodong Wang, Yunhong Wang, Jie Qin, Dongming Zhang, Xiuguo Bao, Di Huang

Abstract: Video Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task, i.e., spatio-temporal jigsaw puzzles, which is cast as a multi-label fine-grained classification problem. Our method exhibits several advantages over existing works: 1) the spatio-tempora… ▽ More Video Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task, i.e., spatio-temporal jigsaw puzzles, which is cast as a multi-label fine-grained classification problem. Our method exhibits several advantages over existing works: 1) the spatio-temporal jigsaw puzzles are decoupled in terms of spatial and temporal dimensions, responsible for capturing highly discriminative appearance and motion features, respectively; 2) full permutations are used to provide abundant jigsaw puzzles covering various difficulty levels, allowing the network to distinguish subtle spatio-temporal differences between normal and abnormal events; and 3) the pretext task is tackled in an end-to-end manner without relying on any pre-trained models. Our method outperforms state-of-the-art counterparts on three public benchmarks. Especially on ShanghaiTech Campus, the result is superior to reconstruction and prediction-based methods by a large margin. △ Less

Submitted 21 July, 2022; v1 submitted 20 July, 2022; originally announced July 2022.

Comments: Accepted by ECCV'2022; Code is available at https://github.com/gdwang08/Jigsaw-VAD

arXiv:2205.03569 [pdf, other]

Representation Learning for Compressed Video Action Recognition via Attentive Cross-modal Interaction with Motion Enhancement

Authors: Bing Li, Jiaxin Chen, Dongming Zhang, Xiuguo Bao, Di Huang

Abstract: Compressed video action recognition has recently drawn growing attention, since it remarkably reduces the storage and computational cost via replacing raw videos by sparsely sampled RGB frames and compressed motion cues (e.g., motion vectors and residuals). However, this task severely suffers from the coarse and noisy dynamics and the insufficient fusion of the heterogeneous RGB and motion modalit… ▽ More Compressed video action recognition has recently drawn growing attention, since it remarkably reduces the storage and computational cost via replacing raw videos by sparsely sampled RGB frames and compressed motion cues (e.g., motion vectors and residuals). However, this task severely suffers from the coarse and noisy dynamics and the insufficient fusion of the heterogeneous RGB and motion modalities. To address the two issues above, this paper proposes a novel framework, namely Attentive Cross-modal Interaction Network with Motion Enhancement (MEACI-Net). It follows the two-stream architecture, i.e. one for the RGB modality and the other for the motion modality. Particularly, the motion stream employs a multi-scale block embedded with a denoising module to enhance representation learning. The interaction between the two streams is then strengthened by introducing the Selective Motion Complement (SMC) and Cross-Modality Augment (CMA) modules, where SMC complements the RGB modality with spatio-temporally attentive local motion features and CMA further combines the two modalities with selective feature augmentation. Extensive experiments on the UCF-101, HMDB-51 and Kinetics-400 benchmarks demonstrate the effectiveness and efficiency of MEACI-Net. △ Less

Submitted 15 June, 2022; v1 submitted 7 May, 2022; originally announced May 2022.

Comments: Accepted to IJCAI 2022

arXiv:2204.09783 [pdf, other]

TopoEmbedding, a web tool for the interactive analysis of persistent homology

Authors: Xueyi Bao, Guoxi Liu, Federico Iuricich

Abstract: Software libraries for Topological Data Analysis (TDA) offer limited support for interactive visualization. Most libraries only allow to visualize topological descriptors (e.g., persistence diagrams), and lose the connection with the original domain of data. This makes it challenging for users to interpret the results of a TDA pipeline in an exploratory context. In this paper, we present TopoEmbed… ▽ More Software libraries for Topological Data Analysis (TDA) offer limited support for interactive visualization. Most libraries only allow to visualize topological descriptors (e.g., persistence diagrams), and lose the connection with the original domain of data. This makes it challenging for users to interpret the results of a TDA pipeline in an exploratory context. In this paper, we present TopoEmbedding, a web-based tool that simplifies the interactive visualization and analysis of persistence-based descriptors. TopoEmbedding allows non-experts in TDA to explore similarities and differences found by TDA descriptors with simple yet effective visualization techniques. △ Less

Submitted 20 April, 2022; originally announced April 2022.

Report number: TDAatSDM/2022/10

arXiv:2203.08406 [pdf, ps, other]

Levenberg-Marquardt Method Based Cooperative Source Localization in SIMO Molecular Communication via Diffusion Systems

Authors: Yuqi Miao, Wence Zhang, Xu Bao

Abstract: Molecular communication underpins nano-scale communications in nanotechnology. The combination of multinanomachines to form nano-networks is one of the main enabling methods. Due to the importance of source localization in establishing nano-networks, this paper proposes a cooperative source localization method for Molecular Communication via Diffusion (MCvD) systems using multiple spherical absorp… ▽ More Molecular communication underpins nano-scale communications in nanotechnology. The combination of multinanomachines to form nano-networks is one of the main enabling methods. Due to the importance of source localization in establishing nano-networks, this paper proposes a cooperative source localization method for Molecular Communication via Diffusion (MCvD) systems using multiple spherical absorption receivers. Since there is no exact mathematical expression of the channel impulse response for multiple absorbing receivers, we adopt an empirical expression and use Levenberg-Marquardt method to estimate the distance of the transmitter to each receiver, based on which the location of the transmitter is obtained using an iterative scheme where the initial point is obtained using a multi-point localization method. Particle based simulation is carried out to evaluate the performance of the proposed method. Simulation results show that the proposed method can accurately estimate the location of transmitter in short to medium communication ranges. △ Less

Submitted 16 March, 2022; originally announced March 2022.

arXiv:2111.05794 [pdf, other]

PIMIP: An Open Source Platform for Pathology Information Management and Integration

Authors: Jialun Wu, Anyu Mao, Xinrui Bao, Haichuan Zhang, Zeyu Gao, Chunbao Wang, Tieliang Gong, Chen Li

Abstract: Digital pathology plays a crucial role in the development of artificial intelligence in the medical field. The digital pathology platform can make the pathological resources digital and networked, and realize the permanent storage of visual data and the synchronous browsing processing without the limitation of time and space. It has been widely used in various fields of pathology. However, there i… ▽ More Digital pathology plays a crucial role in the development of artificial intelligence in the medical field. The digital pathology platform can make the pathological resources digital and networked, and realize the permanent storage of visual data and the synchronous browsing processing without the limitation of time and space. It has been widely used in various fields of pathology. However, there is still a lack of an open and universal digital pathology platform to assist doctors in the management and analysis of digital pathological sections, as well as the management and structured description of relevant patient information. Most platforms cannot integrate image viewing, annotation and analysis, and text information management. To solve the above problems, we propose a comprehensive and extensible platform PIMIP. Our PIMIP has developed the image annotation functions based on the visualization of digital pathological sections. Our annotation functions support multi-user collaborative annotation and multi-device annotation, and realize the automation of some annotation tasks. In the annotation task, we invited a professional pathologist for guidance. We introduce a machine learning module for image analysis. The data we collected included public data from local hospitals and clinical examples. Our platform is more clinical and suitable for clinical use. In addition to image data, we also structured the management and display of text information. So our platform is comprehensive. The platform framework is built in a modular way to support users to add machine learning modules independently, which makes our platform extensible. △ Less

Submitted 9 November, 2021; originally announced November 2021.

Comments: BIBM 2021 accepted, including 8 pages, 8 figures

arXiv:2110.13670 [pdf, other]

W-Net: A Two-Stage Convolutional Network for Nucleus Detection in Histopathology Image

Authors: Anyu Mao, Jialun Wu, Xinrui Bao, Zeyu Gao, Tieliang Gong, Chen Li

Abstract: Pathological diagnosis is the gold standard for cancer diagnosis, but it is labor-intensive, in which tasks such as cell detection, classification, and counting are particularly prominent. A common solution for automating these tasks is using nucleus segmentation technology. However, it is hard to train a robust nucleus segmentation model, due to several challenging problems, the nucleus adhesion,… ▽ More Pathological diagnosis is the gold standard for cancer diagnosis, but it is labor-intensive, in which tasks such as cell detection, classification, and counting are particularly prominent. A common solution for automating these tasks is using nucleus segmentation technology. However, it is hard to train a robust nucleus segmentation model, due to several challenging problems, the nucleus adhesion, stacking, and excessive fusion with the background. Recently, some researchers proposed a series of automatic nucleus segmentation methods based on point annotation, which can significant improve the model performance. Nevertheless, the point annotation needs to be marked by experienced pathologists. In order to take advantage of segmentation methods based on point annotation, further alleviate the manual workload, and make cancer diagnosis more efficient and accurate, it is necessary to develop an automatic nucleus detection algorithm, which can automatically and efficiently locate the position of the nucleus in the pathological image and extract valuable information for pathologists. In this paper, we propose a W-shaped network for automatic nucleus detection. Different from the traditional U-Net based method, mapping the original pathology image to the target mask directly, our proposed method split the detection task into two sub-tasks. The first sub-task maps the original pathology image to the binary mask, then the binary mask is mapped to the density mask in the second sub-task. After the task is split, the task's difficulty is significantly reduced, and the network's overall performance is improved. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: BIBM 2021 accepted,including 8 pages, 3 figures

arXiv:2110.13652 [pdf, other]

A Precision Diagnostic Framework of Renal Cell Carcinoma on Whole-Slide Images using Deep Learning

Authors: Jialun Wu, Haichuan Zhang, Zeyu Gao, Xinrui Bao, Tieliang Gong, Chunbao Wang, Chen Li

Abstract: Diagnostic pathology, which is the basis and gold standard of cancer diagnosis, provides essential information on the prognosis of the disease and vital evidence for clinical treatment. Tumor region detection, subtype and grade classification are the fundamental diagnostic indicators for renal cell carcinoma (RCC) in whole-slide images (WSIs). However, pathological diagnosis is subjective, differe… ▽ More Diagnostic pathology, which is the basis and gold standard of cancer diagnosis, provides essential information on the prognosis of the disease and vital evidence for clinical treatment. Tumor region detection, subtype and grade classification are the fundamental diagnostic indicators for renal cell carcinoma (RCC) in whole-slide images (WSIs). However, pathological diagnosis is subjective, differences in observation and diagnosis between pathologists is common in hospitals with inadequate diagnostic capacity. The main challenge for developing deep learning based RCC diagnostic system is the lack of large-scale datasets with precise annotations. In this work, we proposed a deep learning-based framework for analyzing histopathological images of patients with renal cell carcinoma, which has the potential to achieve pathologist-level accuracy in diagnosis. A deep convolutional neural network (InceptionV3) was trained on the high-quality annotated dataset of The Cancer Genome Atlas (TCGA) whole-slide histopathological image for accurate tumor area detection, classification of RCC subtypes, and ISUP grades classification of clear cell carcinoma subtypes. These results suggest that our framework can help pathologists in the detection of cancer region and classification of subtypes and grades, which could be applied to any cancer type, providing auxiliary diagnosis and promoting clinical consensus. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: BIBM 2021 accepted, 9 pages including reference, 3 figures and 1 table

arXiv:2108.07535 [pdf, other]

SPMoE: Generate Multiple Pattern-Aware Outputs with Sparse Pattern Mixture of Experts

Authors: Shaobo Cui, Xintong Bao, Xuming Lin, Zhongzhou Zhao, Ji Zhang, Wei Zhou, Haiqing Chen

Abstract: Many generation tasks follow a one-to-many mapping relationship: each input could be associated with multiple outputs. Existing methods like Conditional Variational AutoEncoder(CVAE) employ a latent variable to model this one-to-many relationship. However, this high-dimensional and dense latent variable lacks explainability and usually leads to poor and uncontrollable generations. In this paper, w… ▽ More Many generation tasks follow a one-to-many mapping relationship: each input could be associated with multiple outputs. Existing methods like Conditional Variational AutoEncoder(CVAE) employ a latent variable to model this one-to-many relationship. However, this high-dimensional and dense latent variable lacks explainability and usually leads to poor and uncontrollable generations. In this paper, we innovatively introduce the linguistic concept of pattern to decompose the one-to-many mapping into multiple one-to-one mappings and further propose a model named Sparse Pattern Mixture of Experts(SPMoE). Each one-to-one mapping is associated with a conditional generation pattern and is modeled with an expert in SPMoE. To ensure each language pattern can be exclusively handled with an expert model for better explainability and diversity, a sparse mechanism is employed to coordinate all the expert models in SPMoE. We assess the performance of our SPMoE on the paraphrase generation task and the experiment results prove that SPMoE can achieve a good balance in terms of quality, pattern-level diversity, and corpus-level diversity. △ Less

Submitted 17 August, 2021; v1 submitted 17 August, 2021; originally announced August 2021.

arXiv:2108.02768 [pdf, other]

Learning to Elect

Authors: Cem Anil, Xuchan Bao

Abstract: Voting systems have a wide range of applications including recommender systems, web search, product design and elections. Limited by the lack of general-purpose analytical tools, it is difficult to hand-engineer desirable voting rules for each use case. For this reason, it is appealing to automatically discover voting rules geared towards each scenario. In this paper, we show that set-input neural… ▽ More Voting systems have a wide range of applications including recommender systems, web search, product design and elections. Limited by the lack of general-purpose analytical tools, it is difficult to hand-engineer desirable voting rules for each use case. For this reason, it is appealing to automatically discover voting rules geared towards each scenario. In this paper, we show that set-input neural network architectures such as Set Transformers, fully-connected graph networks and DeepSets are both theoretically and empirically well-suited for learning voting rules. In particular, we show that these network models can not only mimic a number of existing voting rules to compelling accuracy -- both position-based (such as Plurality and Borda) and comparison-based (such as Kemeny, Copeland and Maximin) -- but also discover near-optimal voting rules that maximize different social welfare functions. Furthermore, the learned voting rules generalize well to different voter utility distributions and election sizes unseen during training. △ Less

Submitted 1 October, 2021; v1 submitted 5 August, 2021; originally announced August 2021.

arXiv:2102.12128 [pdf, other]

OneStop QAMaker: Extract Question-Answer Pairs from Text in a One-Stop Approach

Authors: Shaobo Cui, Xintong Bao, Xinxing Zu, Yangyang Guo, Zhongzhou Zhao, Ji Zhang, Haiqing Chen

Abstract: Large-scale question-answer (QA) pairs are critical for advancing research areas like machine reading comprehension and question answering. To construct QA pairs from documents requires determining how to ask a question and what is the corresponding answer. Existing methods for QA pair generation usually follow a pipeline approach. Namely, they first choose the most likely candidate answer span an… ▽ More Large-scale question-answer (QA) pairs are critical for advancing research areas like machine reading comprehension and question answering. To construct QA pairs from documents requires determining how to ask a question and what is the corresponding answer. Existing methods for QA pair generation usually follow a pipeline approach. Namely, they first choose the most likely candidate answer span and then generate the answer-specific question. This pipeline approach, however, is undesired in mining the most appropriate QA pairs from documents since it ignores the connection between question generation and answer extraction, which may lead to incompatible QA pair generation, i.e., the selected answer span is inappropriate for question generation. However, for human annotators, we take the whole QA pair into account and consider the compatibility between question and answer. Inspired by such motivation, instead of the conventional pipeline approach, we propose a model named OneStop generate QA pairs from documents in a one-stop approach. Specifically, questions and their corresponding answer span is extracted simultaneously and the process of question generation and answer extraction mutually affect each other. Additionally, OneStop is much more efficient to be trained and deployed in industrial scenarios since it involves only one model to solve the complex QA generation task. We conduct comprehensive experiments on three large-scale machine reading comprehension datasets: SQuAD, NewsQA, and DuReader. The experimental results demonstrate that our OneStop model outperforms the baselines significantly regarding the quality of generated questions, quality of generated question-answer pairs, and model efficiency. △ Less

Submitted 24 February, 2021; originally announced February 2021.

Comments: 8 pages

arXiv:2009.11359 [pdf, other]

A Unified Analysis of First-Order Methods for Smooth Games via Integral Quadratic Constraints

Authors: Guodong Zhang, Xuchan Bao, Laurent Lessard, Roger Grosse

Abstract: The theory of integral quadratic constraints (IQCs) allows the certification of exponential convergence of interconnected systems containing nonlinear or uncertain elements. In this work, we adapt the IQC theory to study first-order methods for smooth and strongly-monotone games and show how to design tailored quadratic constraints to get tight upper bounds of convergence rates. Using this framewo… ▽ More The theory of integral quadratic constraints (IQCs) allows the certification of exponential convergence of interconnected systems containing nonlinear or uncertain elements. In this work, we adapt the IQC theory to study first-order methods for smooth and strongly-monotone games and show how to design tailored quadratic constraints to get tight upper bounds of convergence rates. Using this framework, we recover the existing bound for the gradient method~(GD), derive sharper bounds for the proximal point method~(PPM) and optimistic gradient method~(OG), and provide \emph{for the first time} a global convergence rate for the negative momentum method~(NM) with an iteration complexity $\mathcal{O}(κ^{1.5})$, which matches its known lower bound. In addition, for time-varying systems, we prove that the gradient method with optimal step size achieves the fastest provable worst-case convergence rate with quadratic Lyapunov functions. Finally, we further extend our analysis to stochastic games and study the impact of multiplicative noise on different algorithms. We show that it is impossible for an algorithm with one step of memory to achieve acceleration if it only queries the gradient once per batch (in contrast with the stochastic strongly-convex optimization setting, where such acceleration has been demonstrated). However, we exhibit an algorithm which achieves acceleration with two gradient queries per batch. △ Less

Submitted 26 April, 2021; v1 submitted 23 September, 2020; originally announced September 2020.

Comments: Journal of Machine Learning Research

arXiv:2007.06731 [pdf, other]

Regularized linear autoencoders recover the principal components, eventually

Authors: Xuchan Bao, James Lucas, Sushant Sachdeva, Roger Grosse

Abstract: Our understanding of learning input-output relationships with neural nets has improved rapidly in recent years, but little is known about the convergence of the underlying representations, even in the simple case of linear autoencoders (LAEs). We show that when trained with proper regularization, LAEs can directly learn the optimal representation -- ordered, axis-aligned principal components. We a… ▽ More Our understanding of learning input-output relationships with neural nets has improved rapidly in recent years, but little is known about the convergence of the underlying representations, even in the simple case of linear autoencoders (LAEs). We show that when trained with proper regularization, LAEs can directly learn the optimal representation -- ordered, axis-aligned principal components. We analyze two such regularization schemes: non-uniform $\ell_2$ regularization and a deterministic variant of nested dropout [Rippel et al, ICML' 2014]. Though both regularization schemes converge to the optimal representation, we show that this convergence is slow due to ill-conditioning that worsens with increasing latent dimension. We show that the inefficiency of learning the optimal representation is not inevitable -- we present a simple modification to the gradient descent update that greatly speeds up convergence empirically. △ Less

Submitted 1 October, 2021; v1 submitted 13 July, 2020; originally announced July 2020.

Journal ref: Advances in Neural Information Processing Systems 33 (NeurIPS 2020)

arXiv:2007.04298 [pdf, other]

Building Interpretable Interaction Trees for Deep NLP Models

Authors: Die Zhang, Huilin Zhou, Hao Zhang, Xiaoyi Bao, Da Huo, Ruizhao Chen, Xu Cheng, Mengyue Wu, Quanshi Zhang

Abstract: This paper proposes a method to disentangle and quantify interactions among words that are encoded inside a DNN for natural language processing. We construct a tree to encode salient interactions extracted by the DNN. Six metrics are proposed to analyze properties of interactions between constituents in a sentence. The interaction is defined based on Shapley values of words, which are considered a… ▽ More This paper proposes a method to disentangle and quantify interactions among words that are encoded inside a DNN for natural language processing. We construct a tree to encode salient interactions extracted by the DNN. Six metrics are proposed to analyze properties of interactions between constituents in a sentence. The interaction is defined based on Shapley values of words, which are considered as an unbiased estimation of word contributions to the network prediction. Our method is used to quantify word interactions encoded inside the BERT, ELMo, LSTM, CNN, and Transformer networks. Experimental results have provided a new perspective to understand these DNNs, and have demonstrated the effectiveness of our method. △ Less

Submitted 16 January, 2021; v1 submitted 29 June, 2020; originally announced July 2020.

arXiv:1907.02057 [pdf, other]

Benchmarking Model-Based Reinforcement Learning

Authors: Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, Jimmy Ba

Abstract: Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL. However, research in model-based RL has not been very standardized. It is fairly common for authors to experiment with self-designed environments, and there are several separate lines of research, which are sometimes closed-sourced or not reproducible. Acco… ▽ More Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL. However, research in model-based RL has not been very standardized. It is fairly common for authors to experiment with self-designed environments, and there are several separate lines of research, which are sometimes closed-sourced or not reproducible. Accordingly, it is an open question how these various existing MBRL algorithms perform relative to each other. To facilitate research in MBRL, in this paper we gather a wide collection of MBRL algorithms and propose over 18 benchmarking environments specially designed for MBRL. We benchmark these algorithms with unified problem settings, including noisy environments. Beyond cataloguing performance, we explore and unify the underlying algorithmic differences across MBRL algorithms. We characterize three key research challenges for future MBRL research: the dynamics bottleneck, the planning horizon dilemma, and the early-termination dilemma. Finally, to maximally facilitate future research on MBRL, we open-source our benchmark in http://www.cs.toronto.edu/~tingwuwang/mbrl.html. △ Less

Submitted 3 July, 2019; originally announced July 2019.

Comments: 8 main pages, 8 figures; 14 appendix pages, 25 figures

arXiv:1811.09620 [pdf, other]

TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

Authors: Sicong Huang, Qiyang Li, Cem Anil, Xuchan Bao, Sageev Oore, Roger B. Grosse

Abstract: In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having… ▽ More In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies "image" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples. △ Less

Submitted 22 October, 2023; v1 submitted 22 November, 2018; originally announced November 2018.

Comments: 17 pages, published as a conference paper at ICLR 2019

Journal ref: ICLR 2019

arXiv:1810.08534 [pdf, other]

MsCGAN: Multi-scale Conditional Generative Adversarial Networks for Person Image Generation

Authors: Wei Tang, Gui Li, Xinyuan Bao, Teng Li

Abstract: To synthesize high-quality person images with arbitrary poses is challenging. In this paper, we propose a novel Multi-scale Conditional Generative Adversarial Networks (MsCGAN), aiming to convert the input conditional person image to a synthetic image of any given target pose, whose appearance and the texture are consistent with the input image. MsCGAN is a multi-scale adversarial network consisti… ▽ More To synthesize high-quality person images with arbitrary poses is challenging. In this paper, we propose a novel Multi-scale Conditional Generative Adversarial Networks (MsCGAN), aiming to convert the input conditional person image to a synthetic image of any given target pose, whose appearance and the texture are consistent with the input image. MsCGAN is a multi-scale adversarial network consisting of two generators and two discriminators. One generator transforms the conditional person image into a coarse image of the target pose globally, and the other is to enhance the detailed quality of the synthetic person image through a local reinforcement network. The outputs of the two generators are then merged into a synthetic, discriminant and high-resolution image. On the other hand, the synthetic image is downsampled to multiple resolutions as the input to multi-scale discriminator networks. The proposed multi-scale generators and discriminators handling different levels of visual features can benefit to synthesizing high-resolution person images with realistic appearance and texture. Experiments are conducted on the Market-1501 and DeepFashion datasets to evaluate the proposed model, and both qualitative and quantitative results demonstrate the superior performance of the proposed MsCGAN. △ Less

Submitted 5 March, 2020; v1 submitted 19 October, 2018; originally announced October 2018.

arXiv:1704.03118 [pdf, other]

PIANO: Proximity-based User Authentication on Voice-Powered Internet-of-Things Devices

Authors: Neil Zhenqiang Gong, Altay Ozen, Yu Wu, Xiaoyu Cao, Richard Shin, Dawn Song, Hongxia Jin, Xuan Bao

Abstract: Voice is envisioned to be a popular way for humans to interact with Internet-of-Things (IoT) devices. We propose a proximity-based user authentication method (called PIANO) for access control on such voice-powered IoT devices. PIANO leverages the built-in speaker, microphone, and Bluetooth that voice-powered IoT devices often already have. Specifically, we assume that a user carries a personal voi… ▽ More Voice is envisioned to be a popular way for humans to interact with Internet-of-Things (IoT) devices. We propose a proximity-based user authentication method (called PIANO) for access control on such voice-powered IoT devices. PIANO leverages the built-in speaker, microphone, and Bluetooth that voice-powered IoT devices often already have. Specifically, we assume that a user carries a personal voice-powered device (e.g., smartphone, smartwatch, or smartglass), which serves as the user's identity. When another voice-powered IoT device of the user requires authentication, PIANO estimates the distance between the two devices by playing and detecting certain acoustic signals; PIANO grants access if the estimated distance is no larger than a user-selected threshold. We implemented a proof-of-concept prototype of PIANO. Through theoretical and empirical evaluations, we find that PIANO is secure, reliable, personalizable, and efficient. △ Less

Submitted 10 April, 2017; originally announced April 2017.

Comments: To appear in ICDCS'17

arXiv:1702.08703 [pdf, ps, other]

Widely-Linear Precoding for Large-Scale MIMO with IQI: Algorithms and Performance Analysis

Authors: Wence Zhang, Rodrigo C. de Lamare, Cunhua Pan, Ming Chen, Jianxin Dai, Bingyang Wu, Xu Bao

Abstract: In this paper we study widely-linear precoding techniques to mitigate in-phase/quadrature-phase (IQ) imbalance (IQI) in the downlink of large-scale multiple-input multiple-output (MIMO) systems. We adopt a real-valued signal model which takes into account the IQI at the transmitter and then develop widely-linear zero-forcing (WL-ZF), widely-linear matched filter (WL-MF), widely-linear minimum mean… ▽ More In this paper we study widely-linear precoding techniques to mitigate in-phase/quadrature-phase (IQ) imbalance (IQI) in the downlink of large-scale multiple-input multiple-output (MIMO) systems. We adopt a real-valued signal model which takes into account the IQI at the transmitter and then develop widely-linear zero-forcing (WL-ZF), widely-linear matched filter (WL-MF), widely-linear minimum mean-squared error (WL-MMSE) and widely-linear block-diagonalization (WL-BD) type precoding algorithms for both {\color{red} single- and multiple-antenna users.} We also present a performance analysis of WL-ZF and WL-BD. It is proved that without IQI, WL-ZF has exactly the same multiplexing gain and power offset as ZF, while when IQI exists, WL-ZF achieves the same multiplexing gain as ZF with ideal IQ branches, but with a minor power loss which is related to the system scale and the IQ parameters. We also compare the performance of WL-BD with BD. The analysis shows that with ideal IQ branches, WL-BD has the same data rate as BD, while when IQI exists, WL-BD achieves the same multiplexing gain as BD without IQ imbalance. Numerical results verify the analysis and show that the proposed widely-linear type precoding methods significantly outperform their conventional counterparts with IQI and approach those with ideal IQ branches. △ Less

Submitted 28 February, 2017; originally announced February 2017.

Comments: Accepted in IEEE TWC

arXiv:1610.06283 [pdf, other]

Deep Neural Networks for Improved, Impromptu Trajectory Tracking of Quadrotors

Authors: Qiyang Li, Jingxing Qian, Zining Zhu, Xuchan Bao, Mohamed K. Helwa, Angela P. Schoellig

Abstract: Trajectory tracking control for quadrotors is important for applications ranging from surveying and inspection, to film making. However, designing and tuning classical controllers, such as proportional-integral-derivative (PID) controllers, to achieve high tracking precision can be time-consuming and difficult, due to hidden dynamics and other non-idealities. The Deep Neural Network (DNN), with it… ▽ More Trajectory tracking control for quadrotors is important for applications ranging from surveying and inspection, to film making. However, designing and tuning classical controllers, such as proportional-integral-derivative (PID) controllers, to achieve high tracking precision can be time-consuming and difficult, due to hidden dynamics and other non-idealities. The Deep Neural Network (DNN), with its superior capability of approximating abstract, nonlinear functions, proposes a novel approach for enhancing trajectory tracking control. This paper presents a DNN-based algorithm as an add-on module that improves the tracking performance of a classical feedback controller. Given a desired trajectory, the DNNs provide a tailored reference input to the controller based on their gained experience. The input aims to achieve a unity map between the desired and the output trajectory. The motivation for this work is an interactive "fly-as-you-draw" application, in which a user draws a trajectory on a mobile device, and a quadrotor instantly flies that trajectory with the DNN-enhanced control system. Experimental results demonstrate that the proposed approach improves the tracking precision for user-drawn trajectories after the DNNs are trained on selected periodic trajectories, suggesting the method's potential in real-world applications. Tracking errors are reduced by around 40-50% for both training and testing trajectories from users, highlighting the DNNs' capability of generalizing knowledge. △ Less

Submitted 19 July, 2017; v1 submitted 20 October, 2016; originally announced October 2016.

Comments: 7 pages, 8 figures. Accepted final version. To appear in the proc. of the 2017 IEEE International Conference on Robotics and Automation

arXiv:1608.07188 [pdf, ps, other]

doi 10.1109/LSP.2016.2636319

Root Sparse Bayesian Learning for Off-Grid DOA Estimation

Authors: Jisheng Dai, Xu Bao, Weichao Xu, Chunqi Chang

Abstract: The performance of the existing sparse Bayesian learning (SBL) methods for off-gird DOA estimation is dependent on the trade off between the accuracy and the computational workload. To speed up the off-grid SBL method while remain a reasonable accuracy, this letter describes a computationally efficient root SBL method for off-grid DOA estimation, where a coarse refinable grid, whose sampled locati… ▽ More The performance of the existing sparse Bayesian learning (SBL) methods for off-gird DOA estimation is dependent on the trade off between the accuracy and the computational workload. To speed up the off-grid SBL method while remain a reasonable accuracy, this letter describes a computationally efficient root SBL method for off-grid DOA estimation, where a coarse refinable grid, whose sampled locations are viewed as the adjustable parameters, is adopted. We utilize an expectation-maximization (EM) algorithm to iteratively refine this coarse grid, and illustrate that each updated grid point can be simply achieved by the root of a certain polynomial. Simulation results demonstrate that the computational complexity is significantly reduced and the modeling error can be almost eliminated. △ Less

Submitted 4 December, 2016; v1 submitted 25 August, 2016; originally announced August 2016.

Comments: 4 pages, 4 figures

arXiv:1511.01804 [pdf]

Wood Species Recognition Based on SIFT Keypoint Histogram

Authors: Shuaiqi Hu, Ke Li, Xudong Bao

Abstract: Traditionally, only experts who are equipped with professional knowledge and rich experience are able to recognize different species of wood. Applying image processing techniques for wood species recognition can not only reduce the expense to train qualified identifiers, but also increase the recognition accuracy. In this paper, a wood species recognition technique base on Scale Invariant Feature… ▽ More Traditionally, only experts who are equipped with professional knowledge and rich experience are able to recognize different species of wood. Applying image processing techniques for wood species recognition can not only reduce the expense to train qualified identifiers, but also increase the recognition accuracy. In this paper, a wood species recognition technique base on Scale Invariant Feature Transformation (SIFT) keypoint histogram is proposed. We use first the SIFT algorithm to extract keypoints from wood cross section images, and then k-means and k-means++ algorithms are used for clustering. Using the clustering results, an SIFT keypoints histogram is calculated for each wood image. Furthermore, several classification models, including Artificial Neural Networks (ANN), Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) are used to verify the performance of the method. Finally, through comparing with other prevalent wood recognition methods such as GLCM and LBP, results show that our scheme achieves higher accuracy. △ Less

Submitted 15 December, 2015; v1 submitted 5 November, 2015; originally announced November 2015.

Comments: CISP 2015

arXiv:1401.3582 [pdf, ps, other]

The equivalent identities of the MacWilliams identity for linear codes

Authors: Xiaomin Bao

Abstract: We use derivatives to prove the equivalences between MacWilliams identity and its four equivalent forms, and present new interpretations for the four equivalent forms. Our results explicitly give out the relationships between MacWilliams identity and its four equivalent forms. We use derivatives to prove the equivalences between MacWilliams identity and its four equivalent forms, and present new interpretations for the four equivalent forms. Our results explicitly give out the relationships between MacWilliams identity and its four equivalent forms. △ Less

Submitted 8 February, 2014; v1 submitted 23 December, 2013; originally announced January 2014.

arXiv:1106.5568 [pdf]

Opportunistic Content Search of Smartphone Photos

Authors: Ardalan Amiri Sani, Wolfgang Richter, Xuan Bao, Trevor Narayan, Mahadev Satyanarayanan, Lin Zhong, Romit Roy Choudhury

Abstract: Photos taken by smartphone users can accidentally contain content that is timely and valuable to others, often in real-time. We report the system design and evaluation of a distributed search system, Theia, for crowd-sourced real-time content search of smartphone photos. Because smartphones are resource-constrained, Theia incorporates two key innovations to control search cost and improve search e… ▽ More Photos taken by smartphone users can accidentally contain content that is timely and valuable to others, often in real-time. We report the system design and evaluation of a distributed search system, Theia, for crowd-sourced real-time content search of smartphone photos. Because smartphones are resource-constrained, Theia incorporates two key innovations to control search cost and improve search efficiency. Incremental Search expands search scope incrementally and exploits user feedback. Partitioned Search leverages the cloud to reduce the energy consumption of search in smartphones. Through user studies, measurement studies, and field studies, we show that Theia reduces the cost per relevant photo by an average of 59%. It reduces the energy consumption of search by up to 55% and 81% compared to alternative strategies of executing entirely locally or entirely in the cloud. Search results from smartphones are obtained in seconds. Our experiments also suggest approaches to further improve these results. △ Less

Submitted 28 June, 2011; originally announced June 2011.

Report number: Technical Report TR0627-2011, Rice University

arXiv:1002.3629 [pdf, ps, other]

Generalized Adaptive Network Coded Cooperation (GANCC): A Unified Framework for Network Coding and Channel Coding

Authors: Xingkai Bao, Jing Li

Abstract: This paper considers distributed coding for multi-source single-sink data collection wireless networks. A unified framework for network coding and channel coding, termed "generalized adaptive network coded cooperation" (GANCC), is proposed. Key ingredients of GANCC include: matching code graphs with the dynamic network graphs on-the-fly, and integrating channel coding with network coding through… ▽ More This paper considers distributed coding for multi-source single-sink data collection wireless networks. A unified framework for network coding and channel coding, termed "generalized adaptive network coded cooperation" (GANCC), is proposed. Key ingredients of GANCC include: matching code graphs with the dynamic network graphs on-the-fly, and integrating channel coding with network coding through circulant low-density parity-check codes. Several code constructing methods and several families of sparse-graph codes are proposed, and information theoretical analysis is performed. It is shown that GANCC is simple to operate, adaptive in real time, distributed in nature, and capable of providing remarkable coding gains even with a very limited number of cooperating users. △ Less

Submitted 18 February, 2010; originally announced February 2010.

arXiv:1002.3602 [pdf, ps, other]

Mobile Wireless Localization through Cooperation

Authors: Xingkai Bao, Jing Li

Abstract: This paper considers N mobile nodes that move together in the vicinity of each other, whose initial poses as well as subsequent movements must be accurately tracked in real time with the assist of M(>=3) reference nodes. By engaging the neighboring mobile nodes in a simple but effective cooperation, and by exploiting both the time-of-arrival (TOA) information (between mobile nodes and reference no… ▽ More This paper considers N mobile nodes that move together in the vicinity of each other, whose initial poses as well as subsequent movements must be accurately tracked in real time with the assist of M(>=3) reference nodes. By engaging the neighboring mobile nodes in a simple but effective cooperation, and by exploiting both the time-of-arrival (TOA) information (between mobile nodes and reference nodes) and the received-signal-strength (RSS) information (between mobile nodes), an effective new localization strategy, termed cooperative TOA and RSS (COTAR), is developed. An optimal maximum likelihood detector is first formulated, followed by the derivation of a low-complexity iterative approach that can practically achieve the Cramer-Rao lower bound. Instead of using simplified channel models as in many previous studies, a sophisticated and realistic channel model is used, which can effectively account for the critical fact that the direct path is not necessarily the strongest path. Extensive simulations are conducted in static and mobile settings, and various practical issues and system parameters are evaluated. It is shown that COTAR significantly outperforms the existing strategies, achieving a localization accuracy of only a few tenths of a meter in clear environments and a couple of meters in heavily obstructed environments. △ Less

Submitted 3 August, 2011; v1 submitted 18 February, 2010; originally announced February 2010.

Showing 1–44 of 44 results for author: Bao, X