Search | arXiv e-print repository

Global-Local Convolution with Spiking Neural Networks for Energy-efficient Keyword Spotting

Authors: Shuai Wang, Dehao Zhang, Kexin Shi, Yuchen Wang, Wenjie Wei, Jibin Wu, Malu Zhang

Abstract: Thanks to Deep Neural Networks (DNNs), the accuracy of Keyword Spotting (KWS) has made substantial progress. However, as KWS systems are usually implemented on edge devices, energy efficiency becomes a critical requirement besides performance. Here, we take advantage of spiking neural networks' energy efficiency and propose an end-to-end lightweight KWS model. The model consists of two innovative… ▽ More Thanks to Deep Neural Networks (DNNs), the accuracy of Keyword Spotting (KWS) has made substantial progress. However, as KWS systems are usually implemented on edge devices, energy efficiency becomes a critical requirement besides performance. Here, we take advantage of spiking neural networks' energy efficiency and propose an end-to-end lightweight KWS model. The model consists of two innovative modules: 1) Global-Local Spiking Convolution (GLSC) module and 2) Bottleneck-PLIF module. Compared to the hand-crafted feature extraction methods, the GLSC module achieves speech feature extraction that is sparser, more energy-efficient, and yields better performance. The Bottleneck-PLIF module further processes the signals from GLSC with the aim to achieve higher accuracy with fewer parameters. Extensive experiments are conducted on the Google Speech Commands Dataset (V1 and V2). The results show our method achieves competitive performance among SNN-based KWS models with fewer parameters. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.07835 [pdf, other]

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

Authors: David Wadden, Kejian Shi, Jacob Morrison, Aakanksha Naik, Shruti Singh, Nitzan Barzilay, Kyle Lo, Tom Hope, Luca Soldaini, Shannon Zejiang Shen, Doug Downey, Hannaneh Hajishirzi, Arman Cohan

Abstract: We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF demonstrations are notable for their long input contexts, detailed t… ▽ More We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks covering five essential scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. SciRIFF demonstrations are notable for their long input contexts, detailed task specifications, and complex structured outputs. While instruction-following resources are available in specific domains such as clinical medicine and chemistry, SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields. To demonstrate the utility of SciRIFF, we develop a sample-efficient strategy to adapt a general instruction-following model for science by performing additional finetuning on a mix of general-domain and SciRIFF demonstrations. In evaluations on nine held-out scientific tasks, our model -- called SciTulu -- improves over a strong LLM baseline by 28.1% and 6.5% at the 7B and 70B scales respectively, while maintaining general instruction-following performance within 2% of the baseline. We are optimistic that SciRIFF will facilitate the development and evaluation of LLMs to help researchers navigate the ever-growing body of scientific literature. We release our dataset, model checkpoints, and data processing and evaluation code to enable further research. △ Less

Submitted 18 June, 2024; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Submitted to NeurIPS Datasets and Benchmarks 2024

arXiv:2405.20215 [pdf, other]

TS-Align: A Teacher-Student Collaborative Framework for Scalable Iterative Finetuning of Large Language Models

Authors: Chen Zhang, Chengguang Tang, Dading Chong, Ke Shi, Guohua Tang, Feng Jiang, Haizhou Li

Abstract: Mainstream approaches to aligning large language models (LLMs) heavily rely on human preference data, particularly when models require periodic updates. The standard process for iterative alignment of LLMs involves collecting new human feedback for each update. However, the data collection process is costly and challenging to scale. To address this issue, we introduce the "TS-Align" framework, whi… ▽ More Mainstream approaches to aligning large language models (LLMs) heavily rely on human preference data, particularly when models require periodic updates. The standard process for iterative alignment of LLMs involves collecting new human feedback for each update. However, the data collection process is costly and challenging to scale. To address this issue, we introduce the "TS-Align" framework, which fine-tunes a policy model using pairwise feedback data automatically mined from its outputs. This automatic mining process is efficiently accomplished through the collaboration between a large-scale teacher model and a small-scale student model. The policy fine-tuning process can be iteratively repeated using on-policy generations within our proposed teacher-student collaborative framework. Through extensive experiments, we demonstrate that our final aligned policy outperforms the base policy model with an average win rate of 69.7% across seven conversational or instruction-following datasets. Furthermore, we show that the ranking capability of the teacher is effectively distilled into the student through our pipeline, resulting in a small-scale yet effective reward model for policy model alignment. △ Less

Submitted 14 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.19299 [pdf, other]

Expert-Guided Extinction of Toxic Tokens for Debiased Generation

Authors: Xueyao Sun, Kaize Shi, Haoran Tang, Guandong Xu, Qing Li

Abstract: Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts. Controlling the sensitive attributes in generation encounters challenges in data distribution, generalizability, and efficiency. Specifically, fine-tuning and retrieval demand extensive unbiased corpus, while direct prompting requires meticulously curated instructions for correctin… ▽ More Large language models (LLMs) can elicit social bias during generations, especially when inference with toxic prompts. Controlling the sensitive attributes in generation encounters challenges in data distribution, generalizability, and efficiency. Specifically, fine-tuning and retrieval demand extensive unbiased corpus, while direct prompting requires meticulously curated instructions for correcting the output in multiple rounds of thoughts but poses challenges on memory and inference latency. In this work, we propose the Expert-Guided Extinction of Toxic Tokens for Debiased Generation (EXPOSED) to eliminate the undesired harmful outputs for LLMs without the aforementioned requirements. EXPOSED constructs a debiasing expert based on the abundant toxic corpus to expose and elicit the potentially dangerous tokens. It then processes the output to the LLMs and constructs a fair distribution by suppressing and attenuating the toxic tokens. EXPOSED is evaluated on fairness benchmarks over three LLM families. Extensive experiments demonstrate that compared with other baselines, the proposed EXPOSED significantly reduces the potential social bias while balancing fairness and generation performance. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.17659 [pdf, other]

Enhancing Global Sensitivity and Uncertainty Quantification in Medical Image Reconstruction with Monte Carlo Arbitrary-Masked Mamba

Authors: Jiahao Huang, Liutao Yang, Fanwen Wang, Yang Nan, Weiwen Wu, Chengyan Wang, Kuangyu Shi, Angelica I. Aviles-Rivero, Carola-Bibiane Schönlieb, Daoqiang Zhang, Guang Yang

Abstract: Deep learning has been extensively applied in medical image reconstruction, where Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) represent the predominant paradigms, each possessing distinct advantages and inherent limitations: CNNs exhibit linear complexity with local sensitivity, whereas ViTs demonstrate quadratic complexity with global sensitivity. The emerging Mamba has sh… ▽ More Deep learning has been extensively applied in medical image reconstruction, where Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) represent the predominant paradigms, each possessing distinct advantages and inherent limitations: CNNs exhibit linear complexity with local sensitivity, whereas ViTs demonstrate quadratic complexity with global sensitivity. The emerging Mamba has shown superiority in learning visual representation, which combines the advantages of linear scalability and global sensitivity. In this study, we introduce MambaMIR, an Arbitrary-Masked Mamba-based model with wavelet decomposition for joint medical image reconstruction and uncertainty estimation. A novel Arbitrary Scan Masking (ASM) mechanism "masks out" redundant information to introduce randomness for further uncertainty estimation. Compared to the commonly used Monte Carlo (MC) dropout, our proposed MC-ASM provides an uncertainty map without the need for hyperparameter tuning and mitigates the performance drop typically observed when applying dropout to low-level tasks. For further texture preservation and better perceptual quality, we employ the wavelet transformation into MambaMIR and explore its variant based on the Generative Adversarial Network, namely MambaMIR-GAN. Comprehensive experiments have been conducted for multiple representative medical image reconstruction tasks, demonstrating that the proposed MambaMIR and MambaMIR-GAN outperform other baseline and state-of-the-art methods in different reconstruction tasks, where MambaMIR achieves the best reconstruction fidelity and MambaMIR-GAN has the best perceptual quality. In addition, our MC-ASM provides uncertainty maps as an additional tool for clinicians, while mitigating the typical performance drop caused by the commonly used dropout. △ Less

Submitted 25 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.03085 [pdf, other]

Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation

Authors: Kaize Shi, Xueyao Sun, Qing Li, Guandong Xu

Abstract: Large Language Models (LLMs) have made significant strides in information acquisition. However, their overreliance on potentially flawed parametric knowledge leads to hallucinations and inaccuracies, particularly when handling long-tail, domain-specific queries. Retrieval Augmented Generation (RAG) addresses this limitation by incorporating external, non-parametric knowledge. Nevertheless, the ret… ▽ More Large Language Models (LLMs) have made significant strides in information acquisition. However, their overreliance on potentially flawed parametric knowledge leads to hallucinations and inaccuracies, particularly when handling long-tail, domain-specific queries. Retrieval Augmented Generation (RAG) addresses this limitation by incorporating external, non-parametric knowledge. Nevertheless, the retrieved long-context documents often contain noisy, irrelevant information alongside vital knowledge, negatively diluting LLMs' attention. Inspired by the supportive role of essential concepts in individuals' reading comprehension, we propose a novel concept-based RAG framework with the Abstract Meaning Representation (AMR)-based concept distillation algorithm. The proposed algorithm compresses the cluttered raw retrieved documents into a compact set of crucial concepts distilled from the informative nodes of AMR by referring to reliable linguistic features. The concepts explicitly constrain LLMs to focus solely on vital information in the inference process. We conduct extensive experiments on open-domain question-answering datasets to empirically evaluate the proposed method's effectiveness. The results indicate that the concept-based RAG framework outperforms other baseline methods, particularly as the number of supporting documents increases, while also exhibiting robustness across various backbone LLMs. This emphasizes the distilled concepts are informative for augmenting the RAG process by filtering out interference information. To the best of our knowledge, this is the first work introducing AMR to enhance the RAG, presenting a potential solution to augment inference performance with semantic-based context compression. △ Less

Submitted 5 May, 2024; originally announced May 2024.

arXiv:2405.01109 [pdf, other]

Hypergraph $p$-Laplacian regularization on point clouds for data interpolation

Authors: Kehan Shi, Martin Burger

Abstract: As a generalization of graphs, hypergraphs are widely used to model higher-order relations in data. This paper explores the benefit of the hypergraph structure for the interpolation of point cloud data that contain no explicit structural information. We define the $\varepsilon_n$-ball hypergraph and the $k_n$-nearest neighbor hypergraph on a point cloud and study the $p$-Laplacian regularization o… ▽ More As a generalization of graphs, hypergraphs are widely used to model higher-order relations in data. This paper explores the benefit of the hypergraph structure for the interpolation of point cloud data that contain no explicit structural information. We define the $\varepsilon_n$-ball hypergraph and the $k_n$-nearest neighbor hypergraph on a point cloud and study the $p$-Laplacian regularization on the hypergraphs. We prove the variational consistency between the hypergraph $p$-Laplacian regularization and the continuum $p$-Laplacian regularization in a semisupervised setting when the number of points $n$ goes to infinity while the number of labeled points remains fixed. A key improvement compared to the graph case is that the results rely on weaker assumptions on the upper bound of $\varepsilon_n$ and $k_n$. To solve the convex but non-differentiable large-scale optimization problem, we utilize the stochastic primal-dual hybrid gradient algorithm. Numerical experiments on data interpolation verify that the hypergraph $p$-Laplacian regularization outperforms the graph $p$-Laplacian regularization in preventing the development of spikes at the labeled points. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 33 pages

MSC Class: 49J55; 35J20; 65N12

arXiv:2404.19689 [pdf, ps, other]

Continuum limit of $p$-biharmonic equations on graphs

Authors: Kehan Shi, Martin Burger

Abstract: This paper studies the $p$-biharmonic equation on graphs, which arises in point cloud processing and can be interpreted as a natural extension of the graph $p$-Laplacian from the perspective of hypergraph. The asymptotic behavior of the solution is investigated when the random geometric graph is considered and the number of data points goes to infinity. We show that the continuum limit is an appro… ▽ More This paper studies the $p$-biharmonic equation on graphs, which arises in point cloud processing and can be interpreted as a natural extension of the graph $p$-Laplacian from the perspective of hypergraph. The asymptotic behavior of the solution is investigated when the random geometric graph is considered and the number of data points goes to infinity. We show that the continuum limit is an appropriately weighted $p$-biharmonic equation with homogeneous Neumann boundary conditions. The result relies on the uniform $L^p$ estimates for solutions and gradients of nonlocal and graph Poisson equations. The $L^\infty$ estimates of solutions are also obtained as a byproduct. △ Less

Submitted 30 April, 2024; originally announced April 2024.

Comments: 20 pages

MSC Class: 35R02; 35J30; 65N12

arXiv:2404.14662 [pdf, other]

NExT: Teaching Large Language Models to Reason about Code Execution

Authors: Ansong Ni, Miltiadis Allamanis, Arman Cohan, Yinlin Deng, Kensen Shi, Charles Sutton, Pengcheng Yin

Abstract: A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, large language models (LLMs) of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of h… ▽ More A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, large language models (LLMs) of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of how programs execute at run-time. To address this issue, we propose NExT, a method to teach LLMs to inspect the execution traces of programs (variable states of executed lines) and reason about their run-time behavior through chain-of-thought (CoT) rationales. Specifically, NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions (e.g., fixed programs) without laborious manual annotation. Experiments on program repair tasks based on MBPP and HumanEval demonstrate that NExT improves the fix rate of a PaLM 2 model, by 26.1% and 14.3% absolute, respectively, with significantly improved rationale quality as verified by automated metrics and human raters. Our model can also generalize to scenarios where program traces are absent at test-time. △ Less

Submitted 22 April, 2024; originally announced April 2024.

Comments: 35 pages

arXiv:2404.06851 [pdf, other]

UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion

Authors: Junsheng Zhou, Weiqi Zhang, Baorui Ma, Kanle Shi, Yu-Shen Liu, Zhizhong Han

Abstract: Diffusion models have shown remarkable results for image generation, editing and inpainting. Recent works explore diffusion models for 3D shape generation with neural implicit functions, i.e., signed distance function and occupancy function. However, they are limited to shapes with closed surfaces, which prevents them from generating diverse 3D real-world contents containing open surfaces. In this… ▽ More Diffusion models have shown remarkable results for image generation, editing and inpainting. Recent works explore diffusion models for 3D shape generation with neural implicit functions, i.e., signed distance function and occupancy function. However, they are limited to shapes with closed surfaces, which prevents them from generating diverse 3D real-world contents containing open surfaces. In this work, we present UDiFF, a 3D diffusion model for unsigned distance fields (UDFs) which is capable to generate textured 3D shapes with open surfaces from text conditions or unconditionally. Our key idea is to generate UDFs in spatial-frequency domain with an optimal wavelet transformation, which produces a compact representation space for UDF generation. Specifically, instead of selecting an appropriate wavelet transformation which requires expensive manual efforts and still leads to large information loss, we propose a data-driven approach to learn the optimal wavelet transformation for UDFs. We evaluate UDiFF to show our advantages by numerical and visual comparisons with the latest methods on widely used benchmarks. Page: https://weiqi-zhang.github.io/UDiFF. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: To appear at CVPR2024. Project page: https://weiqi-zhang.github.io/UDiFF

arXiv:2403.19276 [pdf, ps, other]

Enhanced Bayesian Personalized Ranking for Robust Hard Negative Sampling in Recommender Systems

Authors: Kexin Shi, Jing Zhang, Linjiajie Fang, Wenjia Wang, Bingyi Jing

Abstract: In implicit collaborative filtering, hard negative mining techniques are developed to accelerate and enhance the recommendation model learning. However, the inadvertent selection of false negatives remains a major concern in hard negative sampling, as these false negatives can provide incorrect information and mislead the model learning. To date, only a small number of studies have been committed… ▽ More In implicit collaborative filtering, hard negative mining techniques are developed to accelerate and enhance the recommendation model learning. However, the inadvertent selection of false negatives remains a major concern in hard negative sampling, as these false negatives can provide incorrect information and mislead the model learning. To date, only a small number of studies have been committed to solve the false negative problem, primarily focusing on designing sophisticated sampling algorithms to filter false negatives. In contrast, this paper shifts its focus to refining the loss function. We find that the original Bayesian Personalized Ranking (BPR), initially designed for uniform negative sampling, is inadequate in adapting to hard sampling scenarios. Hence, we introduce an enhanced Bayesian Personalized Ranking objective, named as Hard-BPR, which is specifically crafted for dynamic hard negative sampling to mitigate the influence of false negatives. This method is simple yet efficient for real-world deployment. Extensive experiments conducted on three real-world datasets demonstrate the effectiveness and robustness of our approach, along with the enhanced ability to distinguish false negatives. △ Less

Submitted 28 March, 2024; originally announced March 2024.

Comments: 9 pages

arXiv:2403.03346 [pdf, other]

Enhancing Vision-Language Pre-training with Rich Supervisions

Authors: Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto

Abstract: We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localiza… ▽ More We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2403.02249 [pdf, other]

Non-autoregressive Sequence-to-Sequence Vision-Language Models

Authors: Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, Stefano Soatto

Abstract: Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distributi… ▽ More Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR 2024

arXiv:2402.15134 [pdf, other]

Deep Coupling Network For Multivariate Time Series Forecasting

Authors: Kun Yi, Qi Zhang, Hui He, Kaize Shi, Liang Hu, Ning An, Zhendong Niu

Abstract: Multivariate time series (MTS) forecasting is crucial in many real-world applications. To achieve accurate MTS forecasting, it is essential to simultaneously consider both intra- and inter-series relationships among time series data. However, previous work has typically modeled intra- and inter-series relationships separately and has disregarded multi-order interactions present within and between… ▽ More Multivariate time series (MTS) forecasting is crucial in many real-world applications. To achieve accurate MTS forecasting, it is essential to simultaneously consider both intra- and inter-series relationships among time series data. However, previous work has typically modeled intra- and inter-series relationships separately and has disregarded multi-order interactions present within and between time series data, which can seriously degrade forecasting accuracy. In this paper, we reexamine intra- and inter-series relationships from the perspective of mutual information and accordingly construct a comprehensive relationship learning mechanism tailored to simultaneously capture the intricate multi-order intra- and inter-series couplings. Based on the mechanism, we propose a novel deep coupling network for MTS forecasting, named DeepCN, which consists of a coupling mechanism dedicated to explicitly exploring the multi-order intra- and inter-series relationships among time series data concurrently, a coupled variable representation module aimed at encoding diverse variable patterns, and an inference module facilitating predictions through one forward step. Extensive experiments conducted on seven real-world datasets demonstrate that our proposed DeepCN achieves superior performance compared with the state-of-the-art baselines. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2402.11558 [pdf, other]

A Temporally Disentangled Contrastive Diffusion Model for Spatiotemporal Imputation

Authors: Yakun Chen, Kaize Shi, Zhangkai Wu, Juan Chen, Xianzhi Wang, Julian McAuley, Guandong Xu, Shui Yu

Abstract: Spatiotemporal data analysis is pivotal across various domains, such as transportation, meteorology, and healthcare. The data collected in real-world scenarios are often incomplete due to device malfunctions and network errors. Spatiotemporal imputation aims to predict missing values by exploiting the spatial and temporal dependencies in the observed data. Traditional imputation approaches based o… ▽ More Spatiotemporal data analysis is pivotal across various domains, such as transportation, meteorology, and healthcare. The data collected in real-world scenarios are often incomplete due to device malfunctions and network errors. Spatiotemporal imputation aims to predict missing values by exploiting the spatial and temporal dependencies in the observed data. Traditional imputation approaches based on statistical and machine learning techniques require the data to conform to their distributional assumptions, while graph and recurrent neural networks are prone to error accumulation problems due to their recurrent structures. Generative models, especially diffusion models, can potentially circumvent the reliance on inaccurate, previously imputed values for future predictions; However, diffusion models still face challenges in generating stable results. We propose to address these challenges by designing conditional information to guide the generative process and expedite the training process. We introduce a conditional diffusion framework called C$^2$TSD, which incorporates disentangled temporal (trend and seasonality) representations as conditional information and employs contrastive learning to improve generalizability. Our extensive experiments on three real-world datasets demonstrate the superior performance of our approach compared to a number of state-of-the-art baselines. △ Less

Submitted 22 March, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

arXiv:2402.08073 [pdf, other]

Grounding Data Science Code Generation with Input-Output Specifications

Authors: Yeming Wen, Pengcheng Yin, Kensen Shi, Henryk Michalewski, Swarat Chaudhuri, Alex Polozov

Abstract: Large language models (LLMs) have recently demonstrated a remarkable ability to generate code from natural language (NL) prompts. However, in the real world, NL is often too ambiguous to capture the true intent behind programming problems, requiring additional input-output (I/O) specifications. Unfortunately, LLMs can have difficulty aligning their outputs with both the NL prompt and the I/O speci… ▽ More Large language models (LLMs) have recently demonstrated a remarkable ability to generate code from natural language (NL) prompts. However, in the real world, NL is often too ambiguous to capture the true intent behind programming problems, requiring additional input-output (I/O) specifications. Unfortunately, LLMs can have difficulty aligning their outputs with both the NL prompt and the I/O specification. In this paper, we give a way to mitigate this issue in the context of data science programming, where tasks require explicit I/O specifications for clarity. Specifically, we propose GIFT4Code, a novel approach for the instruction fine-tuning of LLMs with respect to I/O specifications. Our method leverages synthetic data produced by the LLM itself and utilizes execution-derived feedback as a key learning signal. This feedback, in the form of program I/O specifications, is provided to the LLM to facilitate instruction fine-tuning. We evaluated our approach on two challenging data science benchmarks, Arcade and DS-1000. The results demonstrate a significant improvement in the LLM's ability to generate code that is not only executable but also accurately aligned with user specifications, substantially improving the quality of code generation for complex data science tasks. △ Less

Submitted 14 March, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

arXiv:2401.08224 [pdf, other]

Privacy Preserving Adaptive Experiment Design

Authors: Jiachun Li, Kaining Shi, David Simchi-Levi

Abstract: Adaptive experiment is widely adopted to estimate conditional average treatment effect (CATE) in clinical trials and many other scenarios. While the primary goal in experiment is to maximize estimation accuracy, due to the imperative of social welfare, it's also crucial to provide treatment with superior outcomes to patients, which is measured by regret in contextual bandit framework. These two ob… ▽ More Adaptive experiment is widely adopted to estimate conditional average treatment effect (CATE) in clinical trials and many other scenarios. While the primary goal in experiment is to maximize estimation accuracy, due to the imperative of social welfare, it's also crucial to provide treatment with superior outcomes to patients, which is measured by regret in contextual bandit framework. These two objectives often lead to contrast optimal allocation mechanism. Furthermore, privacy concerns arise in clinical scenarios containing sensitive data like patients health records. Therefore, it's essential for the treatment allocation mechanism to incorporate robust privacy protection measures. In this paper, we investigate the tradeoff between loss of social welfare and statistical power in contextual bandit experiment. We propose a matched upper and lower bound for the multi-objective optimization problem, and then adopt the concept of Pareto optimality to mathematically characterize the optimality condition. Furthermore, we propose differentially private algorithms which still matches the lower bound, showing that privacy is "almost free". Additionally, we derive the asymptotic normality of the estimator, which is essential in statistical inference and hypothesis testing. △ Less

Submitted 5 February, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: Add a table

arXiv:2401.07402 [pdf, other]

Improved Implicit Neural Representation with Fourier Bases Reparameterized Training

Authors: Kexuan Shi, Xingyu Zhou, Shuhang Gu

Abstract: Implicit Neural Representation (INR) as a mighty representation paradigm has achieved success in various computer vision tasks recently. Due to the low-frequency bias issue of vanilla multi-layer perceptron (MLP), existing methods have investigated advanced techniques, such as positional encoding and periodic activation function, to improve the accuracy of INR. In this paper, we connect the networ… ▽ More Implicit Neural Representation (INR) as a mighty representation paradigm has achieved success in various computer vision tasks recently. Due to the low-frequency bias issue of vanilla multi-layer perceptron (MLP), existing methods have investigated advanced techniques, such as positional encoding and periodic activation function, to improve the accuracy of INR. In this paper, we connect the network training bias with the reparameterization technique and theoretically prove that weight reparameterization could provide us a chance to alleviate the spectral bias of MLP. Based on our theoretical analysis, we propose a Fourier reparameterization method which learns coefficient matrix of fixed Fourier bases to compose the weights of MLP. We evaluate the proposed Fourier reparameterization method on different INR tasks with various MLP architectures, including vanilla MLP, MLP with positional encoding and MLP with advanced activation function, etc. The superiority approximation results on different MLP architectures clearly validate the advantage of our proposed method. Armed with our Fourier reparameterization method, better INR with more textures and less artifacts can be learned from the training data. △ Less

Submitted 5 February, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

arXiv:2401.06827 [pdf, other]

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Authors: Guiming Cao, Kaize Shi, Hong Fu, Huaiwen Zhang, Guandong Xu

Abstract: Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy… ▽ More Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy learnable prompts instead of hand-craft prompts to boost the generalization performance and address the aforementioned challenges. Inspired by layer-wise training, which is wildly used in image fusion, we note that using a sequential training process to adapt different modalities branches of CLIP efficiently facilitates the improvement of generalization. In the context of addressing the multi-modal prompting challenge, we propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner. APLe addresses the challenges in V-L models to promote prompt learning across both modalities, which indicates a competitive generalization performance in line with the state-of-the-art. Preeminently, APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models. △ Less

Submitted 23 January, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

Comments: 7 pages,3 figures

arXiv:2311.16592 [pdf, other]

RGBGrasp: Image-based Object Grasping by Capturing Multiple Views during Robot Arm Movement with Neural Radiance Fields

Authors: Chang Liu, Kejian Shi, Kaichen Zhou, Haoxiao Wang, Jiyao Zhang, Hao Dong

Abstract: Robotic research encounters a significant hurdle when it comes to the intricate task of grasping objects that come in various shapes, materials, and textures. Unlike many prior investigations that heavily leaned on specialized point-cloud cameras or abundant RGB visual data to gather 3D insights for object-grasping missions, this paper introduces a pioneering approach called RGBGrasp. This method… ▽ More Robotic research encounters a significant hurdle when it comes to the intricate task of grasping objects that come in various shapes, materials, and textures. Unlike many prior investigations that heavily leaned on specialized point-cloud cameras or abundant RGB visual data to gather 3D insights for object-grasping missions, this paper introduces a pioneering approach called RGBGrasp. This method depends on a limited set of RGB views to perceive the 3D surroundings containing transparent and specular objects and achieve accurate grasping. Our method utilizes pre-trained depth prediction models to establish geometry constraints, enabling precise 3D structure estimation, even under limited view conditions. Finally, we integrate hash encoding and a proposal sampler strategy to significantly accelerate the 3D reconstruction process. These innovations significantly enhance the adaptability and effectiveness of our algorithm in real-world scenarios. Through comprehensive experimental validations, we demonstrate that RGBGrasp achieves remarkable success across a wide spectrum of object-grasping scenarios, establishing it as a promising solution for real-world robotic manipulation tasks. The demonstrations of our method can be found on: https://sites.google.com/view/rgbgrasp △ Less

Submitted 14 March, 2024; v1 submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.16542 [pdf, other]

Agents meet OKR: An Object and Key Results Driven Agent System with Hierarchical Self-Collaboration and Self-Evaluation

Authors: Yi Zheng, Chongyang Ma, Kanle Shi, Haibin Huang

Abstract: In this study, we introduce the concept of OKR-Agent designed to enhance the capabilities of Large Language Models (LLMs) in task-solving. Our approach utilizes both self-collaboration and self-correction mechanism, facilitated by hierarchical agents, to address the inherent complexities in task-solving. Our key observations are two-fold: first, effective task-solving demands in-depth domain knowl… ▽ More In this study, we introduce the concept of OKR-Agent designed to enhance the capabilities of Large Language Models (LLMs) in task-solving. Our approach utilizes both self-collaboration and self-correction mechanism, facilitated by hierarchical agents, to address the inherent complexities in task-solving. Our key observations are two-fold: first, effective task-solving demands in-depth domain knowledge and intricate reasoning, for which deploying specialized agents for individual sub-tasks can markedly enhance LLM performance. Second, task-solving intrinsically adheres to a hierarchical execution structure, comprising both high-level strategic planning and detailed task execution. Towards this end, our OKR-Agent paradigm aligns closely with this hierarchical structure, promising enhanced efficacy and adaptability across a range of scenarios. Specifically, our framework includes two novel modules: hierarchical Objects and Key Results generation and multi-level evaluation, each contributing to more efficient and robust task-solving. In practical, hierarchical OKR generation decomposes Objects into multiple sub-Objects and assigns new agents based on key results and agent responsibilities. These agents subsequently elaborate on their designated tasks and may further decompose them as necessary. Such generation operates recursively and hierarchically, culminating in a comprehensive set of detailed solutions. The multi-level evaluation module of OKR-Agent refines solution by leveraging feedback from all associated agents, optimizing each step of the process. This ensures solution is accurate, practical, and effectively address intricate task requirements, enhancing the overall reliability and quality of the outcome. Experimental results also show our method outperforms the previous methods on several tasks. Code and demo are available at https://okr-agent.github.io/ △ Less

Submitted 28 November, 2023; originally announced November 2023.

arXiv:2311.00389 [pdf, other]

NeuralGF: Unsupervised Point Normal Estimation by Learning Neural Gradient Function

Authors: Qing Li, Huifang Feng, Kanle Shi, Yue Gao, Yi Fang, Yu-Shen Liu, Zhizhong Han

Abstract: Normal estimation for 3D point clouds is a fundamental task in 3D geometry processing. The state-of-the-art methods rely on priors of fitting local surfaces learned from normal supervision. However, normal supervision in benchmarks comes from synthetic shapes and is usually not available from real scans, thereby limiting the learned priors of these methods. In addition, normal orientation consiste… ▽ More Normal estimation for 3D point clouds is a fundamental task in 3D geometry processing. The state-of-the-art methods rely on priors of fitting local surfaces learned from normal supervision. However, normal supervision in benchmarks comes from synthetic shapes and is usually not available from real scans, thereby limiting the learned priors of these methods. In addition, normal orientation consistency across shapes remains difficult to achieve without a separate post-processing procedure. To resolve these issues, we propose a novel method for estimating oriented normals directly from point clouds without using ground truth normals as supervision. We achieve this by introducing a new paradigm for learning neural gradient functions, which encourages the neural network to fit the input point clouds and yield unit-norm gradients at the points. Specifically, we introduce loss functions to facilitate query points to iteratively reach the moving targets and aggregate onto the approximated surface, thereby learning a global surface representation of the data. Meanwhile, we incorporate gradients into the surface approximation to measure the minimum signed deviation of queries, resulting in a consistent gradient field associated with the surface. These techniques lead to our deep unsupervised oriented normal estimator that is robust to noise, outliers and density variations. Our excellent results on widely used benchmarks demonstrate that our method can learn more accurate normals for both unoriented and oriented normal estimation tasks than the latest methods. The source code and pre-trained model are publicly available at https://github.com/LeoQLi/NeuralGF. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: Accepted by NeurIPS 2023

arXiv:2310.11191 [pdf, other]

Medical Text Simplification: Optimizing for Readability with Unlikelihood Training and Reranked Beam Search Decoding

Authors: Lorenzo Jaime Yu Flores, Heyuan Huang, Kejian Shi, Sophie Chheang, Arman Cohan

Abstract: Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to… ▽ More Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to further improve the readability of text simplification in the medical domain. We propose (1) a new unlikelihood loss that encourages generation of simpler terms and (2) a reranked beam search decoding method that optimizes for simplicity, which achieve better performance on readability metrics on three datasets. This study's findings offer promising avenues for improving text simplification in the medical field. △ Less

Submitted 25 October, 2023; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: EMNLP 2023 Findings

arXiv:2310.08958 [pdf, other]

xDial-Eval: A Multilingual Open-Domain Dialogue Evaluation Benchmark

Authors: Chen Zhang, Luis Fernando D'Haro, Chengguang Tang, Ke Shi, Guohua Tang, Haizhou Li

Abstract: Recent advancements in reference-free learned metrics for open-domain dialogue evaluation have been driven by the progress in pre-trained language models and the availability of dialogue data with high-quality human annotations. However, current studies predominantly concentrate on English dialogues, and the generalization of these metrics to other languages has not been fully examined. This is la… ▽ More Recent advancements in reference-free learned metrics for open-domain dialogue evaluation have been driven by the progress in pre-trained language models and the availability of dialogue data with high-quality human annotations. However, current studies predominantly concentrate on English dialogues, and the generalization of these metrics to other languages has not been fully examined. This is largely due to the absence of a multilingual dialogue evaluation benchmark. To address the issue, we introduce xDial-Eval, built on top of open-source English dialogue evaluation datasets. xDial-Eval includes 12 turn-level and 6 dialogue-level English datasets, comprising 14930 annotated turns and 8691 annotated dialogues respectively. The English dialogue data are extended to nine other languages with commercial machine translation systems. On xDial-Eval, we conduct comprehensive analyses of previous BERT-based metrics and the recently-emerged large language models. Lastly, we establish strong self-supervised and multilingual baselines. In terms of average Pearson correlations over all datasets and languages, the best baseline outperforms OpenAI's ChatGPT by absolute improvements of 6.5% and 4.6% at the turn and dialogue levels respectively, albeit with much fewer parameters. The data and code are publicly available at https://github.com/e0397123/xDial-Eval. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP-2023 Findings

arXiv:2309.14341 [pdf, other]

Extreme Parkour with Legged Robots

Authors: Xuxin Cheng, Kexin Shi, Ananye Agarwal, Deepak Pathak

Abstract: Humans can perform parkour by traversing obstacles in a highly dynamic fashion requiring precise eye-muscle coordination and movement. Getting robots to do the same task requires overcoming similar challenges. Classically, this is done by independently engineering perception, actuation, and control systems to very low tolerances. This restricts them to tightly controlled settings such as a predete… ▽ More Humans can perform parkour by traversing obstacles in a highly dynamic fashion requiring precise eye-muscle coordination and movement. Getting robots to do the same task requires overcoming similar challenges. Classically, this is done by independently engineering perception, actuation, and control systems to very low tolerances. This restricts them to tightly controlled settings such as a predetermined obstacle course in labs. In contrast, humans are able to learn parkour through practice without significantly changing their underlying biology. In this paper, we take a similar approach to developing robot parkour on a small low-cost robot with imprecise actuation and a single front-facing depth camera for perception which is low-frequency, jittery, and prone to artifacts. We show how a single neural net policy operating directly from a camera image, trained in simulation with large-scale RL, can overcome imprecise sensing and actuation to output highly precise control behavior end-to-end. We show our robot can perform a high jump on obstacles 2x its height, long jump across gaps 2x its length, do a handstand and run across tilted ramps, and generalize to novel obstacle courses with different physical properties. Parkour videos at https://extreme-parkour.github.io/ △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: Website and videos at https://extreme-parkour.github.io/

arXiv:2309.09211 [pdf, other]

Neural Gradient Learning and Optimization for Oriented Point Normal Estimation

Authors: Qing Li, Huifang Feng, Kanle Shi, Yi Fang, Yu-Shen Liu, Zhizhong Han

Abstract: We propose Neural Gradient Learning (NGL), a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation. It has excellent gradient approximation properties for the underlying geometry of the data. We utilize a simple neural network to parameterize the objective function to produce gradients at points using a global implicit representatio… ▽ More We propose Neural Gradient Learning (NGL), a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation. It has excellent gradient approximation properties for the underlying geometry of the data. We utilize a simple neural network to parameterize the objective function to produce gradients at points using a global implicit representation. However, the derived gradients usually drift away from the ground-truth oriented normals due to the lack of local detail descriptions. Therefore, we introduce Gradient Vector Optimization (GVO) to learn an angular distance field based on local plane geometry to refine the coarse gradient vectors. Finally, we formulate our method with a two-phase pipeline of coarse estimation followed by refinement. Moreover, we integrate two weighting functions, i.e., anisotropic kernel and inlier score, into the optimization to improve the robust and detail-preserving performance. Our method efficiently conducts global gradient approximation while achieving better accuracy and generalization ability of local feature description. This leads to a state-of-the-art normal estimator that is robust to noise, outliers and point density variations. Extensive evaluations show that our method outperforms previous works in both unoriented and oriented normal estimation on widely used benchmarks. The source code and pre-trained models are available at https://github.com/LeoQLi/NGLO. △ Less

Submitted 17 September, 2023; originally announced September 2023.

Comments: accepted by SIGGRAPH Asia 2023

arXiv:2309.08960 [pdf, other]

ODSum: New Benchmarks for Open Domain Multi-Document Summarization

Authors: Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan

Abstract: Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. With a more inter-related document set, there does not necessarily exist a correct answer for the retrieval, making it hard to measure the retrieving performance. We propose a rule-based method to process query-based document summarization datasets into ODMD… ▽ More Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries. With a more inter-related document set, there does not necessarily exist a correct answer for the retrieval, making it hard to measure the retrieving performance. We propose a rule-based method to process query-based document summarization datasets into ODMDS datasets. Based on this method, we introduce a novel dataset, ODSum, a sophisticated case with its document index interdependent and often interrelated. We tackle ODMDS with the \textit{retrieve-then-summarize} method, and the performance of a list of retrievers and summarizers is investigated. Through extensive experiments, we identify variances in evaluation metrics and provide insights into their reliability. We also found that LLMs suffer great performance loss from retrieving errors. We further experimented methods to improve the performance as well as investigate their robustness against imperfect retrieval. We will release our data and code at https://github.com/yale-nlp/ODSum. △ Less

Submitted 16 September, 2023; originally announced September 2023.

arXiv:2308.04913 [pdf, other]

LLaMA-E: Empowering E-commerce Authoring with Object-Interleaved Instruction Following

Authors: Kaize Shi, Xueyao Sun, Dingxian Wang, Yinlin Fu, Guandong Xu, Qing Li

Abstract: E-commerce authoring entails creating engaging, diverse, and targeted content to enhance preference elicitation and retrieval experience. While Large Language Models (LLMs) have revolutionized content generation, they often fall short in e-commerce applications due to their limited memorization of domain-specific features. This paper proposes LLaMA-E, the unified e-commerce authoring models that a… ▽ More E-commerce authoring entails creating engaging, diverse, and targeted content to enhance preference elicitation and retrieval experience. While Large Language Models (LLMs) have revolutionized content generation, they often fall short in e-commerce applications due to their limited memorization of domain-specific features. This paper proposes LLaMA-E, the unified e-commerce authoring models that address the contextual preferences of customers, sellers, and platforms, the essential objects in e-commerce operation. We design the instruction set derived from tasks of ads generation, query-enhanced product title rewriting, product classification, purchase intent speculation, and general e-commerce Q&A. The instruction formulation ensures the interleaved cover of the presented and required object features, allowing the alignment of base models to parameterise e-commerce knowledge comprehensively. The proposed LLaMA-E models achieve state-of-the-art evaluation performance and exhibit the advantage in zero-shot practical applications. To our knowledge, this is the first LLM tailored to empower authoring applications with comprehensive scenario understanding by integrating features focused on participated objects. △ Less

Submitted 10 June, 2024; v1 submitted 9 August, 2023; originally announced August 2023.

arXiv:2307.13883 [pdf, other]

ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis

Authors: Kensen Shi, Joey Hong, Yinlin Deng, Pengcheng Yin, Manzil Zaheer, Charles Sutton

Abstract: When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we can measure whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more co… ▽ More When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we can measure whether they compositionally generalize, that is, whether a model that has been trained on the simpler subtasks is subsequently able to solve more complex tasks. In this paper, we characterize several different forms of compositional generalization that are desirable in program synthesis, forming a meta-benchmark which we use to create generalization tasks for two popular datasets, RobustFill and DeepCoder. We then propose ExeDec, a novel decomposition-based synthesis strategy that predicts execution subgoals to solve problems step-by-step informed by program execution at each step. When used with Transformer models trained from scratch, ExeDec has better synthesis performance and greatly improved compositional generalization ability compared to baselines. Finally, we use our benchmarks to demonstrate that LLMs struggle to compositionally generalize when asked to do programming-by-example in a few-shot setting, but an ExeDec-style prompting approach can improve the generalization ability and overall performance. △ Less

Submitted 6 May, 2024; v1 submitted 25 July, 2023; originally announced July 2023.

Comments: ICLR 2024

arXiv:2307.12187 [pdf, other]

Monadic Deep Learning

Authors: Bo Yang, Zhihao Zhang Kirisame Marisa, Kai Shi

Abstract: The Java and Scala community has built a very successful big data ecosystem. However, most of neural networks running on it are modeled in dynamically typed programming languages. These dynamically typed deep learning frameworks treat neural networks as differentiable expressions that contain many trainable variable, and perform automatic differentiation on those expressions when training them.… ▽ More The Java and Scala community has built a very successful big data ecosystem. However, most of neural networks running on it are modeled in dynamically typed programming languages. These dynamically typed deep learning frameworks treat neural networks as differentiable expressions that contain many trainable variable, and perform automatic differentiation on those expressions when training them. Until 2019, none of the learning frameworks in statically typed languages provided the expressive power of traditional frameworks. Their users are not able to use custom algorithms unless creating plenty of boilerplate code for hard-coded back-propagation. We solved this problem in DeepLearning.scala 2. Our contributions are: 1. We discovered a novel approach to perform automatic differentiation in reverse mode for statically typed functions that contain multiple trainable variable, and can interoperate freely with the metalanguage. 2. We designed a set of monads and monad transformers, which allow users to create monadic expressions that represent dynamic neural networks. 3. Along with these monads, we provide some applicative functors, to perform multiple calculations in parallel. With these features, users of DeepLearning.scala were able to create complex neural networks in an intuitive and concise way, and still maintain type safety. △ Less

Submitted 22 July, 2023; originally announced July 2023.

Comments: 27 pages, 7 figures, 3 tables

arXiv:2306.13729 [pdf, other]

doi 10.62056/a0qj89n4e

On the Two-sided Permutation Inversion Problem

Authors: Gorjan Alagic, Chen Bai, Alexander Poremba, Kaiyan Shi

Abstract: In the permutation inversion problem, the task is to find the preimage of some challenge value, given oracle access to the permutation. This is a fundamental problem in query complexity, and appears in many contexts, particularly cryptography. In this work, we examine the setting in which the oracle allows for quantum queries to both the forward and the inverse direction of the permutation -- exce… ▽ More In the permutation inversion problem, the task is to find the preimage of some challenge value, given oracle access to the permutation. This is a fundamental problem in query complexity, and appears in many contexts, particularly cryptography. In this work, we examine the setting in which the oracle allows for quantum queries to both the forward and the inverse direction of the permutation -- except that the challenge value cannot be submitted to the latter. Within that setting, we consider two options for the inversion algorithm: whether it can get quantum advice about the permutation, and whether it must produce the entire preimage (search) or only the first bit (decision). We prove several theorems connecting the hardness of the resulting variations of the inversion problem, and establish a number of lower bounds. Our results indicate that, perhaps surprisingly, the inversion problem does not become significantly easier when the adversary is granted oracle access to the inverse, provided it cannot query the challenge itself. △ Less

Submitted 21 April, 2024; v1 submitted 23 June, 2023; originally announced June 2023.

Comments: 32 pages. Published in Communications in Cryptology

Journal ref: IACR Communications in Cryptology, Vol. 1, no. 1, Apr 09, 2024

arXiv:2306.12794 [pdf, other]

Overview of Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems at DSTC 11 Track 4

Authors: Mario Rodríguez-Cantelar, Chen Zhang, Chengguang Tang, Ke Shi, Sarik Ghazarian, João Sedoc, Luis Fernando D'Haro, Alexander Rudnicky

Abstract: The advent and fast development of neural networks have revolutionized the research on dialogue systems and subsequently have triggered various challenges regarding their automatic evaluation. Automatic evaluation of open-domain dialogue systems as an open challenge has been the center of the attention of many researchers. Despite the consistent efforts to improve automatic metrics' correlations w… ▽ More The advent and fast development of neural networks have revolutionized the research on dialogue systems and subsequently have triggered various challenges regarding their automatic evaluation. Automatic evaluation of open-domain dialogue systems as an open challenge has been the center of the attention of many researchers. Despite the consistent efforts to improve automatic metrics' correlations with human evaluation, there have been very few attempts to assess their robustness over multiple domains and dimensions. Also, their focus is mainly on the English language. All of these challenges prompt the development of automatic evaluation metrics that are reliable in various domains, dimensions, and languages. This track in the 11th Dialogue System Technology Challenge (DSTC11) is part of the ongoing effort to promote robust and multilingual automatic evaluation metrics. This article describes the datasets and baselines provided to participants and discusses the submission and result details of the two proposed subtasks. △ Less

Submitted 13 September, 2023; v1 submitted 22 June, 2023; originally announced June 2023.

arXiv:2306.02049 [pdf, other]

LambdaBeam: Neural Program Search with Higher-Order Functions and Lambdas

Authors: Kensen Shi, Hanjun Dai, Wen-Ding Li, Kevin Ellis, Charles Sutton

Abstract: Search is an important technique in program synthesis that allows for adaptive strategies such as focusing on particular search directions based on execution results. Several prior works have demonstrated that neural models are effective at guiding program synthesis searches. However, a common drawback of those approaches is the inability to handle iterative loops, higher-order functions, or lambd… ▽ More Search is an important technique in program synthesis that allows for adaptive strategies such as focusing on particular search directions based on execution results. Several prior works have demonstrated that neural models are effective at guiding program synthesis searches. However, a common drawback of those approaches is the inability to handle iterative loops, higher-order functions, or lambda functions, thus limiting prior neural searches from synthesizing longer and more general programs. We address this gap by designing a search algorithm called LambdaBeam that can construct arbitrary lambda functions that compose operations within a given DSL. We create semantic vector representations of the execution behavior of the lambda functions and train a neural policy network to choose which lambdas to construct during search, and pass them as arguments to higher-order functions to perform looping computations. Our experiments show that LambdaBeam outperforms neural, symbolic, and LLM-based techniques in an integer list manipulation domain. △ Less

Submitted 28 October, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

arXiv:2305.14239 [pdf, other]

On Learning to Summarize with Large Language Models as References

Authors: Yixin Liu, Kejian Shi, Katherine S He, Longtian Ye, Alexander R. Fabbri, Pengfei Liu, Dragomir Radev, Arman Cohan

Abstract: Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we investigate a new learning setting of text summarization models that considers the LLMs as the reference or the gold-standard oracle on these datasets. To examine the standard practices that a… ▽ More Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we investigate a new learning setting of text summarization models that considers the LLMs as the reference or the gold-standard oracle on these datasets. To examine the standard practices that are aligned with this new learning setting, we investigate two LLM-based summary quality evaluation methods for model training and adopt a contrastive learning training method to leverage the LLM-guided learning signals. Our experiments on the CNN/DailyMail and XSum datasets demonstrate that smaller summarization models can achieve similar performance as LLMs under LLM-based evaluation. However, we found that the smaller models can not yet reach LLM-level performance under human evaluation despite promising improvements brought by our proposed training methods. Meanwhile, we perform a meta-analysis on this new learning setting that reveals a discrepancy between human and LLM-based evaluation, highlighting the benefits and risks of this LLM-as-reference setting we investigated. △ Less

Submitted 16 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: GitHub Repo: https://github.com/yixinL7/SumLLM

arXiv:2305.10569 [pdf, other]

Self-Supervised Learning for Physiologically-Based Pharmacokinetic Modeling in Dynamic PET

Authors: Francesca De Benetti, Walter Simson, Magdalini Paschali, Hasan Sari, Axel Romiger, Kuangyu Shi, Nassir Navab, Thomas Wendler

Abstract: Dynamic positron emission tomography imaging (dPET) provides temporally resolved images of a tracer enabling a quantitative measure of physiological processes. Voxel-wise physiologically-based pharmacokinetic (PBPK) modeling of the time activity curves (TAC) can provide relevant diagnostic information for clinical workflow. Conventional fitting strategies for TACs are slow and ignore the spatial r… ▽ More Dynamic positron emission tomography imaging (dPET) provides temporally resolved images of a tracer enabling a quantitative measure of physiological processes. Voxel-wise physiologically-based pharmacokinetic (PBPK) modeling of the time activity curves (TAC) can provide relevant diagnostic information for clinical workflow. Conventional fitting strategies for TACs are slow and ignore the spatial relation between neighboring voxels. We train a spatio-temporal UNet to estimate the kinetic parameters given TAC from F-18-fluorodeoxyglucose (FDG) dPET. This work introduces a self-supervised loss formulation to enforce the similarity between the measured TAC and those generated with the learned kinetic parameters. Our method provides quantitatively comparable results at organ-level to the significantly slower conventional approaches, while generating pixel-wise parametric images which are consistent with expected physiology. To the best of our knowledge, this is the first self-supervised network that allows voxel-wise computation of kinetic parameters consistent with a non-linear kinetic model. The code will become publicly available upon acceptance. △ Less

Submitted 17 May, 2023; originally announced May 2023.

arXiv:2305.07019 [pdf, other]

Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts

Authors: Zhaoyang Zhang, Yantao Shen, Kunyu Shi, Zhaowei Cai, Jun Fang, Siqi Deng, Hao Yang, Davide Modolo, Zhuowen Tu, Stefano Soatto

Abstract: We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks which may interfere with each other, resulting in a single model which we named Musketeer. The integration of knowledge across heterogeneous tasks is enabled by a novel feature called Task Explanation Prompt (TEP). With rich and structured information such as tas… ▽ More We present a vision-language model whose parameters are jointly trained on all tasks and fully shared among multiple heterogeneous tasks which may interfere with each other, resulting in a single model which we named Musketeer. The integration of knowledge across heterogeneous tasks is enabled by a novel feature called Task Explanation Prompt (TEP). With rich and structured information such as task input/output format, TEP reduces interference among tasks, allowing the model to focus on their shared structure. With a single model, Musketeer achieves results comparable to or better than strong baselines trained on single tasks, almost uniformly across multiple tasks. △ Less

Submitted 14 March, 2024; v1 submitted 11 May, 2023; originally announced May 2023.

arXiv:2305.05873 [pdf, other]

SHS-Net: Learning Signed Hyper Surfaces for Oriented Normal Estimation of Point Clouds

Authors: Qing Li, Huifang Feng, Kanle Shi, Yue Gao, Yi Fang, Yu-Shen Liu, Zhizhong Han

Abstract: We propose a novel method called SHS-Net for oriented normal estimation of point clouds by learning signed hyper surfaces, which can accurately predict normals with global consistent orientation from various point clouds. Almost all existing methods estimate oriented normals through a two-stage pipeline, i.e., unoriented normal estimation and normal orientation, and each step is implemented by a s… ▽ More We propose a novel method called SHS-Net for oriented normal estimation of point clouds by learning signed hyper surfaces, which can accurately predict normals with global consistent orientation from various point clouds. Almost all existing methods estimate oriented normals through a two-stage pipeline, i.e., unoriented normal estimation and normal orientation, and each step is implemented by a separate algorithm. However, previous methods are sensitive to parameter settings, resulting in poor results from point clouds with noise, density variations and complex geometries. In this work, we introduce signed hyper surfaces (SHS), which are parameterized by multi-layer perceptron (MLP) layers, to learn to estimate oriented normals from point clouds in an end-to-end manner. The signed hyper surfaces are implicitly learned in a high-dimensional feature space where the local and global information is aggregated. Specifically, we introduce a patch encoding module and a shape encoding module to encode a 3D point cloud into a local latent code and a global latent code, respectively. Then, an attention-weighted normal prediction module is proposed as a decoder, which takes the local and global latent codes as input to predict oriented normals. Experimental results show that our SHS-Net outperforms the state-of-the-art methods in both unoriented and oriented normal estimation on the widely used benchmarks. The code, data and pretrained models are publicly available. △ Less

Submitted 9 May, 2023; originally announced May 2023.

Comments: Accepted by CVPR 2023

arXiv:2304.12035 [pdf, other]

GRIG: Few-Shot Generative Residual Image Inpainting

Authors: Wanglong Lu, Xianta Jiang, Xiaogang Jin, Yong-Liang Yang, Minglun Gong, Tao Wang, Kaijie Shi, Hanli Zhao

Abstract: Image inpainting is the task of filling in missing or masked region of an image with semantically meaningful contents. Recent methods have shown significant improvement in dealing with large-scale missing regions. However, these methods usually require large training datasets to achieve satisfactory results and there has been limited research into training these models on a small number of samples… ▽ More Image inpainting is the task of filling in missing or masked region of an image with semantically meaningful contents. Recent methods have shown significant improvement in dealing with large-scale missing regions. However, these methods usually require large training datasets to achieve satisfactory results and there has been limited research into training these models on a small number of samples. To address this, we present a novel few-shot generative residual image inpainting method that produces high-quality inpainting results. The core idea is to propose an iterative residual reasoning method that incorporates Convolutional Neural Networks (CNNs) for feature extraction and Transformers for global reasoning within generative adversarial networks, along with image-level and patch-level discriminators. We also propose a novel forgery-patch adversarial training strategy to create faithful textures and detailed appearances. Extensive evaluations show that our method outperforms previous methods on the few-shot image inpainting task, both quantitatively and qualitatively. △ Less

Submitted 24 April, 2023; originally announced April 2023.

Comments: There are 12 pages and 10 figures in this paper

ACM Class: I.4.4; I.4.5; I.4.9

arXiv:2304.00570 [pdf, other]

FedFTN: Personalized Federated Learning with Deep Feature Transformation Network for Multi-institutional Low-count PET Denoising

Authors: Bo Zhou, Huidong Xie, Qiong Liu, Xiongchao Chen, Xueqi Guo, Zhicheng Feng, Jun Hou, S. Kevin Zhou, Biao Li, Axel Rominger, Kuangyu Shi, James S. Duncan, Chi Liu

Abstract: Low-count PET is an efficient way to reduce radiation exposure and acquisition time, but the reconstructed images often suffer from low signal-to-noise ratio (SNR), thus affecting diagnosis and other downstream tasks. Recent advances in deep learning have shown great potential in improving low-count PET image quality, but acquiring a large, centralized, and diverse dataset from multiple institutio… ▽ More Low-count PET is an efficient way to reduce radiation exposure and acquisition time, but the reconstructed images often suffer from low signal-to-noise ratio (SNR), thus affecting diagnosis and other downstream tasks. Recent advances in deep learning have shown great potential in improving low-count PET image quality, but acquiring a large, centralized, and diverse dataset from multiple institutions for training a robust model is difficult due to privacy and security concerns of patient data. Moreover, low-count PET data at different institutions may have different data distribution, thus requiring personalized models. While previous federated learning (FL) algorithms enable multi-institution collaborative training without the need of aggregating local data, addressing the large domain shift in the application of multi-institutional low-count PET denoising remains a challenge and is still highly under-explored. In this work, we propose FedFTN, a personalized federated learning strategy that addresses these challenges. FedFTN uses a local deep feature transformation network (FTN) to modulate the feature outputs of a globally shared denoising network, enabling personalized low-count PET denoising for each institution. During the federated learning process, only the denoising network's weights are communicated and aggregated, while the FTN remains at the local institutions for feature transformation. We evaluated our method using a large-scale dataset of multi-institutional low-count PET imaging data from three medical centers located across three continents, and showed that FedFTN provides high-quality low-count PET images, outperforming previous baseline FL reconstruction methods across all low-count levels at all three institutions. △ Less

Submitted 6 October, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

Comments: 13 pages, 6 figures, Accepted at Medical Image Analysis Journal (MedIA)

arXiv:2302.08582 [pdf, other]

Pretraining Language Models with Human Preferences

Authors: Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez

Abstract: Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark… ▽ More Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training. △ Less

Submitted 14 June, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: ICML 2023

arXiv:2302.04260 [pdf, other]

The Test of Tests: A Framework For Differentially Private Hypothesis Testing

Authors: Zeki Kazan, Kaiyan Shi, Adam Groce, Andrew Bray

Abstract: We present a generic framework for creating differentially private versions of any hypothesis test in a black-box way. We analyze the resulting tests analytically and experimentally. Most crucially, we show good practical performance for small data sets, showing that at epsilon = 1 we only need 5-6 times as much data as in the fully public setting. We compare our work to the one existing framework… ▽ More We present a generic framework for creating differentially private versions of any hypothesis test in a black-box way. We analyze the resulting tests analytically and experimentally. Most crucially, we show good practical performance for small data sets, showing that at epsilon = 1 we only need 5-6 times as much data as in the fully public setting. We compare our work to the one existing framework of this type, as well as to several individually-designed private hypothesis tests. Our framework is higher power than other generic solutions and at least competitive with (and often better than) individually-designed tests. △ Less

Submitted 8 February, 2023; originally announced February 2023.

Comments: The main text is 14 pages and 4 figures. Appendices are 10 pages and 12 figures

arXiv:2301.07527 [pdf, other]

Evaluating Permissioned Blockchain Using Stochastic Modeling and Chaos Engineering

Authors: Shiv Sondhi, Sherif Saad, Kevin Shi, Mohammad Mamun, Issa Traore

Abstract: Blockchain and distributed ledger technologies rely on distributed consensus algorithms. In recent years many consensus algorithms and protocols have been proposed; most of them are for permissioned blockchain networks. However, the performance of these algorithms is not well understood. This paper introduces an approach to evaluating consensus algorithms and blockchain platforms in a hostile netw… ▽ More Blockchain and distributed ledger technologies rely on distributed consensus algorithms. In recent years many consensus algorithms and protocols have been proposed; most of them are for permissioned blockchain networks. However, the performance of these algorithms is not well understood. This paper introduces an approach to evaluating consensus algorithms and blockchain platforms in a hostile network environment with the presence of byzantine and other network failures. The approach starts by using stochastic modeling to model the behaviors of consensus algorithms under different typical and faulty operational scenarios. Next, we implemented a blockchain application using different consensus protocols and tested their performance using chaos engineering techniques. To demonstrate our generic evaluation approach, we analyze the performance of four permissioned blockchain platforms and their consensus protocols. Our results showed that stochastic modeling is an inexpensive and efficient technique for analyzing consensus protocols. But they do not represent the actual performance of the consensus protocols in a production environment. Moreover, an experiment with chaos engineering indicates that if two different blockchain platforms use the same blockchain algorithm or protocol, we should not assume they will have similar performance. Therefore, It is also essential to consider the role of platform architecture and how the protocols are engineered in a given platform. △ Less

Submitted 14 January, 2023; originally announced January 2023.

Comments: 21. arXiv admin note: text overlap with arXiv:2108.08441

arXiv:2212.09248 [pdf, other]

Natural Language to Code Generation in Interactive Data Science Notebooks

Authors: Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Alex Polozov, Charles Sutton

Abstract: Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using… ▽ More Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions. △ Less

Submitted 19 December, 2022; originally announced December 2022.

Comments: 46 pages. 32 figures

arXiv:2211.13912 [pdf, ps, other]

Enhancing Recommender Systems: A Strategy to Mitigate False Negative Impact

Authors: Kexin Shi, Yun Zhang, Bingyi Jing, Wenjia Wang

Abstract: In implicit collaborative filtering (CF) task of recommender systems, recent works mainly focus on model structure design with promising techniques like graph neural networks (GNNs). Effective and efficient negative sampling methods that suit these models, however, remain underdeveloped. One challenge is that existing hard negative samplers tend to suffer from severer over-fitting in model trainin… ▽ More In implicit collaborative filtering (CF) task of recommender systems, recent works mainly focus on model structure design with promising techniques like graph neural networks (GNNs). Effective and efficient negative sampling methods that suit these models, however, remain underdeveloped. One challenge is that existing hard negative samplers tend to suffer from severer over-fitting in model training. In this work, we first study the reason behind the over-fitting, and illustrate it with the incorrect selection of false negative instances with the support of experiments. In addition, we empirically observe a counter-intuitive phenomenon, that is, polluting hard negative samples' embeddings with a quite large proportional of positive samples' embeddings will lead to remarkable performance gains for prediction accuracy. On top of this finding, we present a novel negative sampling strategy, i.e., positive-dominated negative synthesizing (PDNS). Moreover, we provide theoretical analysis and derive a simple equivalent algorithm of PDNS, where only a soft factor is added in the loss function. Comprehensive experiments on three real-world datasets demonstrate the superiority of our proposed method in terms of both effectiveness and robustness. △ Less

Submitted 28 March, 2024; v1 submitted 25 November, 2022; originally announced November 2022.

Comments: 9 pages, 16 figures

arXiv:2211.06598 [pdf, ps, other]

Enhancing Resource Utilization of Non-terrestrial Networks Using Temporal Graph-based Deterministic Routing

Authors: Keyi Shi, Jingchao Wang, Hongyan Li, Kan Wang

Abstract: Deterministic routing has emerged as a promising technology for future non-terrestrial networks (NTNs), offering the potential to enhance service performance and optimize resource utilization. However, the dynamic nature of network topology and resources poses challenges in establishing deterministic routing. These challenges encompass the intricacy of jointly scheduling transmission links and cyc… ▽ More Deterministic routing has emerged as a promising technology for future non-terrestrial networks (NTNs), offering the potential to enhance service performance and optimize resource utilization. However, the dynamic nature of network topology and resources poses challenges in establishing deterministic routing. These challenges encompass the intricacy of jointly scheduling transmission links and cycles, as well as the difficulty of maintaining stable end-to-end (E2E) routing paths. To tackle these challenges, our work introduces an efficient temporal graph-based deterministic routing strategy. Initially, we utilize a time-expanded graph (TEG) to represent the heterogeneous resources of an NTN in a time-slotted manner. With TEG, we meticulously define each necessary constraint and formulate the deterministic routing problem. Subsequently, we transform this nonlinear problem equivalently into solvable integer linear programming (ILP), providing a robust yet time-consuming performance upper bound. To address the considered problem with reduced complexity, we extend TEG by introducing virtual nodes and edges. This extension facilitates a uniform representation of heterogeneous network resources and traffic transmission requirements. Consequently, we propose a polynomial-time complexity algorithm, enabling the dynamic selection of optimal transmission links and cycles on a hop-by-hop basis. Simulation results validate that the proposed algorithm yields significant performance gains in traffic acceptance, justifying its additional complexity compared to existing routing strategies. △ Less

Submitted 22 January, 2024; v1 submitted 12 November, 2022; originally announced November 2022.

arXiv:2211.03885 [pdf, other]

Learned Smartphone ISP on Mobile GPUs with Deep Learning, Mobile AI & AIM 2022 Challenge: Report

Authors: Andrey Ignatov, Radu Timofte, Shuai Liu, Chaoyu Feng, Furui Bai, Xiaotao Wang, Lei Lei, Ziyao Yi, Yan Xiang, Zibin Liu, Shaoqing Li, Keming Shi, Dehui Kong, Ke Xu, Minsu Kwon, Yaqi Wu, Jiesi Zheng, Zhihao Fan, Xun Wu, Feng Zhang, Albert No, Minhyeok Cho, Zewen Chen, Xiaze Zhang, Ran Li , et al. (13 additional authors not shown)

Abstract: The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. Th… ▽ More The role of mobile cameras increased dramatically over the past few years, leading to more and more research in automatic image quality enhancement and RAW photo processing. In this Mobile AI challenge, the target was to develop an efficient end-to-end AI-based image signal processing (ISP) pipeline replacing the standard mobile ISPs that can run on modern smartphone GPUs using TensorFlow Lite. The participants were provided with a large-scale Fujifilm UltraISP dataset consisting of thousands of paired photos captured with a normal mobile camera sensor and a professional 102MP medium-format FujiFilm GFX100 camera. The runtime of the resulting models was evaluated on the Snapdragon's 8 Gen 1 GPU that provides excellent acceleration results for the majority of common deep learning ops. The proposed solutions are compatible with all recent mobile GPUs, being able to process Full HD photos in less than 20-50 milliseconds while achieving high fidelity results. A detailed description of all models developed in this challenge is provided in this paper. △ Less

Submitted 7 November, 2022; originally announced November 2022.

arXiv:2211.00312 [pdf, other]

HDNet: Hierarchical Dynamic Network for Gait Recognition using Millimeter-Wave Radar

Authors: Yanyan Huang, Yong Wang, Kun Shi, Chaojie Gu, Yu Fu, Cheng Zhuo, Zhiguo Shi

Abstract: Gait recognition is widely used in diversified practical applications. Currently, the most prevalent approach is to recognize human gait from RGB images, owing to the progress of computer vision technologies. Nevertheless, the perception capability of RGB cameras deteriorates in rough circumstances, and visual surveillance may cause privacy invasion. Due to the robustness and non-invasive feature… ▽ More Gait recognition is widely used in diversified practical applications. Currently, the most prevalent approach is to recognize human gait from RGB images, owing to the progress of computer vision technologies. Nevertheless, the perception capability of RGB cameras deteriorates in rough circumstances, and visual surveillance may cause privacy invasion. Due to the robustness and non-invasive feature of millimeter wave (mmWave) radar, radar-based gait recognition has attracted increasing attention in recent years. In this research, we propose a Hierarchical Dynamic Network (HDNet) for gait recognition using mmWave radar. In order to explore more dynamic information, we propose point flow as a novel point clouds descriptor. We also devise a dynamic frame sampling module to promote the efficiency of computation without deteriorating performance noticeably. To prove the superiority of our methods, we perform extensive experiments on two public mmWave radar-based gait recognition datasets, and the results demonstrate that our model is superior to existing state-of-the-art methods. △ Less

Submitted 1 November, 2022; originally announced November 2022.

arXiv:2210.01841 [pdf, other]

Learning Perception-Aware Agile Flight in Cluttered Environments

Authors: Yunlong Song, Kexin Shi, Robert Penicka, Davide Scaramuzza

Abstract: Recently, neural control policies have outperformed existing model-based planning-and-control methods for autonomously navigating quadrotors through cluttered environments in minimum time. However, they are not perception aware, a crucial requirement in vision-based navigation due to the camera's limited field of view and the underactuated nature of a quadrotor. We propose a learning-based system… ▽ More Recently, neural control policies have outperformed existing model-based planning-and-control methods for autonomously navigating quadrotors through cluttered environments in minimum time. However, they are not perception aware, a crucial requirement in vision-based navigation due to the camera's limited field of view and the underactuated nature of a quadrotor. We propose a learning-based system that achieves perception-aware, agile flight in cluttered environments. Our method combines imitation learning with reinforcement learning (RL) by leveraging a privileged learning-by-cheating framework. Using RL, we first train a perception-aware teacher policy with full-state information to fly in minimum time through cluttered environments. Then, we use imitation learning to distill its knowledge into a vision-based student policy that only perceives the environment via a camera. Our approach tightly couples perception and control, showing a significant advantage in computation speed (10 times faster) and success rate. We demonstrate the closed-loop control performance using hardware-in-the-loop simulation. △ Less

Submitted 3 March, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

Journal ref: 2023 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2210.01346 [pdf, other]

doi 10.1109/ICRA48891.2023.10161428

ImmFusion: Robust mmWave-RGB Fusion for 3D Human Body Reconstruction in All Weather Conditions

Authors: Anjun Chen, Xiangyu Wang, Kun Shi, Shaohao Zhu, Bin Fang, Yingfeng Chen, Jiming Chen, Yuchi Huo, Qi Ye

Abstract: 3D human reconstruction from RGB images achieves decent results in good weather conditions but degrades dramatically in rough weather. Complementary, mmWave radars have been employed to reconstruct 3D human joints and meshes in rough weather. However, combining RGB and mmWave signals for robust all-weather 3D human reconstruction is still an open challenge, given the sparse nature of mmWave and th… ▽ More 3D human reconstruction from RGB images achieves decent results in good weather conditions but degrades dramatically in rough weather. Complementary, mmWave radars have been employed to reconstruct 3D human joints and meshes in rough weather. However, combining RGB and mmWave signals for robust all-weather 3D human reconstruction is still an open challenge, given the sparse nature of mmWave and the vulnerability of RGB images. In this paper, we present ImmFusion, the first mmWave-RGB fusion solution to reconstruct 3D human bodies in all weather conditions robustly. Specifically, our ImmFusion consists of image and point backbones for token feature extraction and a Transformer module for token fusion. The image and point backbones refine global and local features from original data, and the Fusion Transformer Module aims for effective information fusion of two modalities by dynamically selecting informative tokens. Extensive experiments on a large-scale dataset, mmBody, captured in various environments demonstrate that ImmFusion can efficiently utilize the information of two modalities to achieve a robust 3D human body reconstruction in all weather conditions. In addition, our method's accuracy is significantly superior to that of state-of-the-art Transformer-based LiDAR-camera fusion methods. △ Less

Submitted 20 September, 2023; v1 submitted 3 October, 2022; originally announced October 2022.

Comments: Accepted to ICRA2023, Project Page: https://chen3110.github.io/ImmFusion/index.html

arXiv:2210.00227

Attention Augmented ConvNeXt UNet For Rectal Tumour Segmentation

Authors: Hongwei Wu, Junlin Wang, Xin Wang, Hui Nan, Yaxin Wang, Haonan Jing, Kaixuan Shi

Abstract: It is a challenge to segment the location and size of rectal cancer tumours through deep learning. In this paper, in order to improve the ability of extracting suffi-cient feature information in rectal tumour segmentation, attention enlarged ConvNeXt UNet (AACN-UNet), is proposed. The network mainly includes two improvements: 1) the encoder stage of UNet is changed to ConvNeXt structure for encodi… ▽ More It is a challenge to segment the location and size of rectal cancer tumours through deep learning. In this paper, in order to improve the ability of extracting suffi-cient feature information in rectal tumour segmentation, attention enlarged ConvNeXt UNet (AACN-UNet), is proposed. The network mainly includes two improvements: 1) the encoder stage of UNet is changed to ConvNeXt structure for encoding operation, which can not only integrate multi-scale semantic information on a large scale, but al-so reduce information loss and extract more feature information from CT images; 2) CBAM attention mechanism is added to improve the connection of each feature in channel and space, which is conducive to extracting the effective feature of the target and improving the segmentation accuracy.The experiment with UNet and its variant network shows that AACN-UNet is 0.9% ,1.1% and 1.4% higher than the current best results in P, F1 and Miou.Compared with the training time, the number of parameters in UNet network is less. This shows that our proposed AACN-UNet has achieved ex-cellent results in CT image segmentation of rectal cancer. △ Less

Submitted 26 October, 2022; v1 submitted 1 October, 2022; originally announced October 2022.

Comments: I plan to replace this article, and supplement and confirm the structure and experimental content of this article

Showing 1–50 of 97 results for author: Shi, K