Search | arXiv e-print repository

The Llama 3 Herd of Models

Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (508 additional authors not shown)

Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development. △ Less

Submitted 31 July, 2024; originally announced July 2024.

arXiv:2407.05740 [pdf, other]

Do Multilingual Large Language Models Mitigate Stereotype Bias?

Authors: Shangrui Nie, Michael Fromm, Charles Welch, Rebekka Görge, Akbar Karimi, Joan Plepi, Nazia Afsan Mowmita, Nicolas Flores-Herr, Mehdi Ali, Lucie Flek

Abstract: While preliminary findings indicate that multilingual LLMs exhibit reduced bias compared to monolingual ones, a comprehensive understanding of the effect of multilingual training on bias mitigation, is lacking. This study addresses this gap by systematically training six LLMs of identical size (2.6B parameters) and architecture: five monolingual models (English, German, French, Italian, and Spanis… ▽ More While preliminary findings indicate that multilingual LLMs exhibit reduced bias compared to monolingual ones, a comprehensive understanding of the effect of multilingual training on bias mitigation, is lacking. This study addresses this gap by systematically training six LLMs of identical size (2.6B parameters) and architecture: five monolingual models (English, German, French, Italian, and Spanish) and one multilingual model trained on an equal distribution of data across these languages, all using publicly available data. To ensure robust evaluation, standard bias benchmarks were automatically translated into the five target languages and verified for both translation quality and bias preservation by human annotators. Our results consistently demonstrate that multilingual training effectively mitigates bias. Moreover, we observe that multilingual models achieve not only lower bias but also superior prediction accuracy when compared to monolingual models with the same amount of training data, model architecture, and size. △ Less

Submitted 9 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

Comments: 19 pages, 8 figures, C3NLP 2024

arXiv:2406.03736 [pdf, other]

Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

Authors: Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, Chongxuan Li

Abstract: Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time… ▽ More Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval. Empirically, RADD is up to 3.5 times faster while achieving similar performance with the strongest baseline. Built upon the new perspective of conditional distributions, we further unify absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs. Further, our RADD models achieve SOTA performance among diffusion models on 5 zero-shot language modeling benchmarks (measured by perplexity) at the GPT-2 scale. Our code is available at https://github.com/ML-GSAI/RADD. △ Less

Submitted 6 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

arXiv:2405.05741 [pdf, ps, other]

Can large language models understand uncommon meanings of common words?

Authors: Jinyang Wu, Feihu Che, Xinxin Zheng, Shuai Zhang, Ruihan Jin, Shuai Nie, Pengpeng Shao, Jianhua Tao

Abstract: Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. P… ▽ More Large language models (LLMs) like ChatGPT have shown significant advancements across diverse natural language understanding (NLU) tasks, including intelligent dialogue and autonomous agents. Yet, lacking widely acknowledged testing mechanisms, answering `whether LLMs are stochastic parrots or genuinely comprehend the world' remains unclear, fostering numerous studies and sparking heated debates. Prevailing research mainly focuses on surface-level NLU, neglecting fine-grained explorations. However, such explorations are crucial for understanding their unique comprehension mechanisms, aligning with human cognition, and finally enhancing LLMs' general NLU capacities. To address this gap, our study delves into LLMs' nuanced semantic comprehension capabilities, particularly regarding common words with uncommon meanings. The idea stems from foundational principles of human communication within psychology, which underscore accurate shared understandings of word semantics. Specifically, this paper presents the innovative construction of a Lexical Semantic Comprehension (LeSC) dataset with novel evaluation metrics, the first benchmark encompassing both fine-grained and cross-lingual dimensions. Introducing models of both open-source and closed-source, varied scales and architectures, our extensive empirical experiments demonstrate the inferior performance of existing models in this basic lexical-meaning understanding task. Notably, even the state-of-the-art LLMs GPT-4 and GPT-3.5 lag behind 16-year-old humans by 3.9% and 22.3%, respectively. Additionally, multiple advanced prompting techniques and retrieval-augmented generation are also introduced to help alleviate this trouble, yet limitations persist. By highlighting the above critical shortcomings, this research motivates further investigation and offers novel insights for developing more intelligent LLMs. △ Less

Submitted 9 May, 2024; originally announced May 2024.

arXiv:2404.15766 [pdf, other]

Unifying Bayesian Flow Networks and Diffusion Models through Stochastic Differential Equations

Authors: Kaiwen Xue, Yuhao Zhou, Shen Nie, Xu Min, Xiaolu Zhang, Jun Zhou, Chongxuan Li

Abstract: Bayesian flow networks (BFNs) iteratively refine the parameters, instead of the samples in diffusion models (DMs), of distributions at various noise levels through Bayesian inference. Owing to its differentiable nature, BFNs are promising in modeling both continuous and discrete data, while simultaneously maintaining fast sampling capabilities. This paper aims to understand and enhance BFNs by con… ▽ More Bayesian flow networks (BFNs) iteratively refine the parameters, instead of the samples in diffusion models (DMs), of distributions at various noise levels through Bayesian inference. Owing to its differentiable nature, BFNs are promising in modeling both continuous and discrete data, while simultaneously maintaining fast sampling capabilities. This paper aims to understand and enhance BFNs by connecting them with DMs through stochastic differential equations (SDEs). We identify the linear SDEs corresponding to the noise-addition processes in BFNs, demonstrate that BFN's regression losses are aligned with denoise score matching, and validate the sampler in BFN as a first-order solver for the respective reverse-time SDE. Based on these findings and existing recipes of fast sampling in DMs, we propose specialized solvers for BFNs that markedly surpass the original BFN sampler in terms of sample quality with a limited number of function evaluations (e.g., 10) on both image and text datasets. Notably, our best sampler achieves an increase in speed of 5~20 times for free. Our code is available at https://github.com/ML-GSAI/BFN-Solver. △ Less

Submitted 2 June, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

Comments: Published as a conference paper at ICML 2024

arXiv:2404.15660 [pdf, other]

KS-LLM: Knowledge Selection of Large Language Models with Evidence Document for Question Answering

Authors: Xinxin Zheng, Feihu Che, Jinyang Wu, Shuai Zhang, Shuai Nie, Kang Liu, Jianhua Tao

Abstract: Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks. A promising approach is to leverage evidence documents as extra supporting knowledge, which can be obtained through retrieval or generation. However, existing methods directly leverage the entire contents of the evidence document, which may introduce noise i… ▽ More Large language models (LLMs) suffer from the hallucination problem and face significant challenges when applied to knowledge-intensive tasks. A promising approach is to leverage evidence documents as extra supporting knowledge, which can be obtained through retrieval or generation. However, existing methods directly leverage the entire contents of the evidence document, which may introduce noise information and impair the performance of large language models. To tackle this problem, we propose a novel Knowledge Selection of Large Language Models (KS-LLM) method, aiming to identify valuable information from evidence documents. The KS-LLM approach utilizes triples to effectively select knowledge snippets from evidence documents that are beneficial to answering questions. Specifically, we first generate triples based on the input question, then select the evidence sentences most similar to triples from the evidence document, and finally combine the evidence sentences and triples to assist large language models in generating answers. Experimental comparisons on several question answering datasets, such as TriviaQA, WebQ, and NQ, demonstrate that the proposed method surpasses the baselines and achieves the best results. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.12980 [pdf, other]

Ring-a-Pose: A Ring for Continuous Hand Pose Tracking

Authors: Tianhong Catherine Yu, Guilin Hu, Ruidong Zhang, Hyunchul Lim, Saif Mahmud, Chi-Jung Lee, Ke Li, Devansh Agarwal, Shuyang Nie, Jinseok Oh, François Guimbretière, Cheng Zhang

Abstract: We present Ring-a-Pose, a single untethered ring that tracks continuous 3D hand poses. Located in the center of the hand, the ring emits an inaudible acoustic signal that each hand pose reflects differently. Ring-a-Pose imposes minimal obtrusions on the hand, unlike multi-ring or glove systems. It is not affected by the choice of clothing that may cover wrist-worn systems. In a series of three use… ▽ More We present Ring-a-Pose, a single untethered ring that tracks continuous 3D hand poses. Located in the center of the hand, the ring emits an inaudible acoustic signal that each hand pose reflects differently. Ring-a-Pose imposes minimal obtrusions on the hand, unlike multi-ring or glove systems. It is not affected by the choice of clothing that may cover wrist-worn systems. In a series of three user studies with a total of 30 participants, we evaluate Ring-a-Pose's performance on pose tracking and micro-finger gesture recognition. Without collecting any training data from a user, Ring-a-Pose tracks continuous hand poses with a joint error of 14.1mm. The joint error decreases to 10.3mm for fine-tuned user-dependent models. Ring-a-Pose recognizes 7-class micro-gestures with a 90.60% and 99.27% accuracy for user-independent and user-dependent models, respectively. Furthermore, the ring exhibits promising performance when worn on any finger. Ring-a-Pose enables the future of smart rings to track and recognize hand poses using relatively low-power acoustic sensing. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2401.11161 [pdf, other]

BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching

Authors: Ling Jiang, Junwen An, Huihui Huang, Qiyi Tang, Sen Nie, Shi Wu, Yuqun Zhang

Abstract: While third-party libraries are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis, proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA id… ▽ More While third-party libraries are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis, proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA identifies the third-party source projects contained in binary files via binary source code matching, which is a major challenge in reverse engineering since binary and source code exhibit substantial disparities after compilation. The existing binary-to-source SCA techniques leverage basic syntactic features that suffer from redundancy and lack robustness in the large-scale TPL dataset, leading to inevitable false positives and compromised recall. To mitigate these limitations, we introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features. First, BinaryAI trains a transformer-based model to produce function-level embeddings and obtain similar source functions for each binary function accordingly. Then by applying the link-time locality to facilitate function matching, BinaryAI detects the reused TPLs based on the ratio of matched source functions. Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA task. Specifically, our embedding model outperforms the state-of-the-art model CodeCMR, i.e., achieving 22.54% recall@1 and 0.34 MRR compared with 10.75% and 0.17 respectively. Additionally, BinaryAI outperforms all existing binary-to-source SCA tools in TPL detection, increasing the precision from 73.36% to 85.84% and recall from 59.81% to 64.98% compared with the well-recognized commercial SCA product Black Duck. △ Less

Submitted 23 January, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

Comments: In Proceedings of the 46th International Conference on Software Engineering (ICSE'24)

arXiv:2311.01410 [pdf, other]

The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing

Authors: Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, Chenyu Zheng, Chongxuan Li

Abstract: We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-… ▽ More We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods. △ Less

Submitted 29 February, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

arXiv:2310.11738 [pdf, other]

Unleashing the Power of Clippy in Real-World Rust Projects

Authors: Chunmiao Li, Yijun Yu, Haitao Wu, Luca Carlig, Shijie Nie, Lingxiao Jiang

Abstract: Clippy lints are considered as essential tools for Rust developers, as they can be configured as gate-keeping rules for a Rust project during continuous integration. Despite their availability, little was known about practical application and cost-effectiveness of the lints in reducing code quality issues. In this study, we embark on a comprehensive analysis to unveil the true impact of Clippy lin… ▽ More Clippy lints are considered as essential tools for Rust developers, as they can be configured as gate-keeping rules for a Rust project during continuous integration. Despite their availability, little was known about practical application and cost-effectiveness of the lints in reducing code quality issues. In this study, we embark on a comprehensive analysis to unveil the true impact of Clippy lints in the Rust development landscape. The study is structured around three interrelated components, each contributing to the overall effectiveness of Clippy. Firstly, we conduct a comprehensive analysis of Clippy lints in all idiomatic crates-io Rust projects with an average warning density of 21/KLOC. The analysis identifies the most cost-effective lint fixes, offering valuable opportunities for optimizing code quality. Secondly, we actively engage Rust developers through a user survey to garner invaluable feedback on their experiences with Clippy. User insights shed light on two crucial concerns: the prevalence of false positives in warnings and the need for auto-fix support for most warnings. Thirdly, building upon these findings, we engineer three innovative automated refactoring techniques to effectively fix the four most frequent Clippy lints. As a result, the warning density in Rosetta benchmarks has significantly decreased from 195/KLOC to an impressive 18/KLOC, already lower than the average density of the crates-io Rust projects. These results demonstrate tangible benefit and impact of our efforts in enhancing the overall code quality and maintainability for Rust developers. △ Less

Submitted 18 October, 2023; originally announced October 2023.

arXiv:2310.06530 [pdf, other]

Refining Decompiled C Code with Large Language Models

Authors: Wai Kin Wong, Huaijin Wang, Zongjie Li, Zhibo Liu, Shuai Wang, Qiyi Tang, Sen Nie, Shi Wu

Abstract: A C decompiler converts an executable into source code. The recovered C source code, once re-compiled, is expected to produce an executable with the same functionality as the original executable. With over twenty years of development, C decompilers have been widely used in production to support reverse engineering applications. Despite the prosperous development of C decompilers, it is widely ackn… ▽ More A C decompiler converts an executable into source code. The recovered C source code, once re-compiled, is expected to produce an executable with the same functionality as the original executable. With over twenty years of development, C decompilers have been widely used in production to support reverse engineering applications. Despite the prosperous development of C decompilers, it is widely acknowledged that decompiler outputs are mainly used for human consumption, and are not suitable for automatic recompilation. Often, a substantial amount of manual effort is required to fix the decompiler outputs before they can be recompiled and executed properly. This paper is motived by the recent success of large language models (LLMs) in comprehending dense corpus of natural language. To alleviate the tedious, costly and often error-prone manual effort in fixing decompiler outputs, we investigate the feasibility of using LLMs to augment decompiler outputs, thus delivering recompilable decompilation. Note that different from previous efforts that focus on augmenting decompiler outputs with higher readability (e.g., recovering type/variable names), we focus on augmenting decompiler outputs with recompilability, meaning to generate code that can be recompiled into an executable with the same functionality as the original executable. We conduct a pilot study to characterize the obstacles in recompiling the outputs of the de facto commercial C decompiler -- IDA-Pro. We then propose a two-step, hybrid approach to augmenting decompiler outputs with LLMs. We evaluate our approach on a set of popular C test cases, and show that our approach can deliver a high recompilation success rate to over 75% with moderate effort, whereas none of the IDA-Pro's original outputs can be recompiled. We conclude with a discussion on the limitations of our approach and promising future research directions. △ Less

Submitted 28 November, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

arXiv:2310.00183 [pdf, other]

On the Equivalence of Graph Convolution and Mixup

Authors: Xiaotian Han, Hanqing Zeng, Yu Chen, Shaoliang Nie, Jingzhou Liu, Kanika Narang, Zahra Shakeri, Karthik Abinav Sankararaman, Song Jiang, Madian Khabsa, Qifan Wang, Xia Hu

Abstract: This paper investigates the relationship between graph convolution and Mixup techniques. Graph convolution in a graph neural network involves aggregating features from neighboring samples to learn representative features for a specific node or sample. On the other hand, Mixup is a data augmentation technique that generates new examples by averaging features and one-hot labels from multiple samples… ▽ More This paper investigates the relationship between graph convolution and Mixup techniques. Graph convolution in a graph neural network involves aggregating features from neighboring samples to learn representative features for a specific node or sample. On the other hand, Mixup is a data augmentation technique that generates new examples by averaging features and one-hot labels from multiple samples. One commonality between these techniques is their utilization of information from multiple samples to derive feature representation. This study aims to explore whether a connection exists between these two approaches. Our investigation reveals that, under two mild conditions, graph convolution can be viewed as a specialized form of Mixup that is applied during both the training and testing phases. The two conditions are: 1) \textit{Homophily Relabel} - assigning the target node's label to all its neighbors, and 2) \textit{Test-Time Mixup} - Mixup the feature during the test time. We establish this equivalence mathematically by demonstrating that graph convolution networks (GCN) and simplified graph convolution (SGC) can be expressed as a form of Mixup. We also empirically verify the equivalence by training an MLP using the two conditions to achieve comparable performance. △ Less

Submitted 29 September, 2023; originally announced October 2023.

arXiv:2305.13774 [pdf, other]

ADD 2023: the Second Audio Deepfake Detection Challenge

Authors: Jiangyan Yi, Jianhua Tao, Ruibo Fu, Xinrui Yan, Chenglong Wang, Tao Wang, Chu Yuan Zhang, Xiaohui Zhang, Yan Zhao, Yong Ren, Le Xu, Junzuo Zhou, Hao Gu, Zhengqi Wen, Shan Liang, Zheng Lian, Shuai Nie, Haizhou Li

Abstract: Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on s… ▽ More Audio deepfake detection is an emerging topic in the artificial intelligence community. The second Audio Deepfake Detection Challenge (ADD 2023) aims to spur researchers around the world to build new innovative technologies that can further accelerate and foster research on detecting and analyzing deepfake speech utterances. Different from previous challenges (e.g. ADD 2022), ADD 2023 focuses on surpassing the constraints of binary real/fake classification, and actually localizing the manipulated intervals in a partially fake speech as well as pinpointing the source responsible for generating any fake audio. Furthermore, ADD 2023 includes more rounds of evaluation for the fake audio game sub-challenge. The ADD 2023 challenge includes three subchallenges: audio fake game (FG), manipulation region location (RL) and deepfake algorithm recognition (AR). This paper describes the datasets, evaluation metrics, and protocols. Some findings are also reported in audio deepfake detection tasks. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.07095 [pdf, other]

Are Machine Rationales (Not) Useful to Humans? Measuring and Improving Human Utility of Free-Text Rationales

Authors: Brihi Joshi, Ziyi Liu, Sahana Ramnath, Aaron Chan, Zhewei Tong, Shaoliang Nie, Qifan Wang, Yejin Choi, Xiang Ren

Abstract: Among the remarkable emergent capabilities of large language models (LMs) is free-text rationalization; beyond a certain scale, large LMs are capable of generating seemingly useful rationalizations, which in turn, can dramatically enhance their performances on leaderboards. This phenomenon raises a question: can machine generated rationales also be useful for humans, especially when lay humans try… ▽ More Among the remarkable emergent capabilities of large language models (LMs) is free-text rationalization; beyond a certain scale, large LMs are capable of generating seemingly useful rationalizations, which in turn, can dramatically enhance their performances on leaderboards. This phenomenon raises a question: can machine generated rationales also be useful for humans, especially when lay humans try to answer questions based on those machine rationales? We observe that human utility of existing rationales is far from satisfactory, and expensive to estimate with human studies. Existing metrics like task performance of the LM generating the rationales, or similarity between generated and gold rationales are not good indicators of their human utility. While we observe that certain properties of rationales like conciseness and novelty are correlated with their human utility, estimating them without human involvement is challenging. We show that, by estimating a rationale's helpfulness in answering similar unseen instances, we can measure its human utility to a better extent. We also translate this finding into an automated score, GEN-U, that we propose, which can help improve LMs' ability to generate rationales with better human utility, while maintaining most of its task performance. Lastly, we release all code and collected data with this project. △ Less

Submitted 11 May, 2023; originally announced May 2023.

Comments: Accepted at ACL 2023

arXiv:2304.02838 [pdf, other]

TBDetector:Transformer-Based Detector for Advanced Persistent Threats with Provenance Graph

Authors: Nan Wang, Xuezhi Wen, Dalin Zhang, Xibin Zhao, Jiahui Ma, Mengxia Luo, Sen Nie, Shi Wu, Jiqiang Liu

Abstract: APT detection is difficult to detect due to the long-term latency, covert and slow multistage attack patterns of Advanced Persistent Threat (APT). To tackle these issues, we propose TBDetector, a transformer-based advanced persistent threat detection method for APT attack detection. Considering that provenance graphs provide rich historical information and have the powerful attacks historic correl… ▽ More APT detection is difficult to detect due to the long-term latency, covert and slow multistage attack patterns of Advanced Persistent Threat (APT). To tackle these issues, we propose TBDetector, a transformer-based advanced persistent threat detection method for APT attack detection. Considering that provenance graphs provide rich historical information and have the powerful attacks historic correlation ability to identify anomalous activities, TBDetector employs provenance analysis for APT detection, which summarizes long-running system execution with space efficiency and utilizes transformer with self-attention based encoder-decoder to extract long-term contextual features of system states to detect slow-acting attacks. Furthermore, we further introduce anomaly scores to investigate the anomaly of different system states, where each state is calculated with an anomaly score corresponding to its similarity score and isolation score. To evaluate the effectiveness of the proposed method, we have conducted experiments on five public datasets, i.e., streamspot, cadets, shellshock, clearscope, and wget_baseline. Experimental results and comparisons with state-of-the-art methods have exhibited better performance of our proposed method. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: 10 pages, 7 figures

arXiv:2303.06555 [pdf, other]

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

Authors: Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu

Abstract: This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. I… ▽ More This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation). △ Less

Submitted 30 May, 2023; v1 submitted 11 March, 2023; originally announced March 2023.

Comments: Accepted to ICML2023

arXiv:2302.12247 [pdf, other]

Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework

Authors: Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, Ruslan Salakhutdinov, Louis-Philippe Morency

Abstract: The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimo… ▽ More The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimodal models to capture these interactions? To answer these questions, we propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. We term these three measures as the PID statistics of a multimodal distribution (or PID for short), and introduce two new estimators for these PID statistics that scale to high-dimensional distributions. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks where PID estimations are compared with human annotations. Finally, we demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies engaging with domain experts in pathology, mood prediction, and robotic perception where our framework helps to recommend strong multimodal models for each application. △ Less

Submitted 10 December, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

Comments: NeurIPS 2023. Code available at: https://github.com/pliang279/PID

arXiv:2210.15500 [pdf, other]

COFFEE: Counterfactual Fairness for Personalized Text Generation in Explainable Recommendation

Authors: Nan Wang, Qifan Wang, Yi-Chia Wang, Maziar Sanjabi, Jingzhou Liu, Hamed Firooz, Hongning Wang, Shaoliang Nie

Abstract: As language models become increasingly integrated into our digital lives, Personalized Text Generation (PTG) has emerged as a pivotal component with a wide range of applications. However, the bias inherent in user written text, often used for PTG model training, can inadvertently associate different levels of linguistic quality with users' protected attributes. The model can inherit the bias and p… ▽ More As language models become increasingly integrated into our digital lives, Personalized Text Generation (PTG) has emerged as a pivotal component with a wide range of applications. However, the bias inherent in user written text, often used for PTG model training, can inadvertently associate different levels of linguistic quality with users' protected attributes. The model can inherit the bias and perpetuate inequality in generating text w.r.t. users' protected attributes, leading to unfair treatment when serving users. In this work, we investigate fairness of PTG in the context of personalized explanation generation for recommendations. We first discuss the biases in generated explanations and their fairness implications. To promote fairness, we introduce a general framework to achieve measure-specific counterfactual fairness in explanation generation. Extensive experiments and human evaluations demonstrate the effectiveness of our method. △ Less

Submitted 22 October, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

Comments: This is a long paper accepted by the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)

arXiv:2210.15159 [pdf, other]

Comparing One with Many -- Solving Binary2source Function Matching Under Function Inlining

Authors: Ang Jia, Ming Fan, Xi Xu, Wuxia Jin, Haijun Wang, Qiyi Tang, Sen Nie, Shi Wu, Ting Liu

Abstract: Binary2source function matching is a fundamental task for many security applications, including Software Component Analysis (SCA). The "1-to-1" mechanism has been applied in existing binary2source matching works, in which one binary function is matched against one source function. However, we discovered that such mapping could be "1-to-n" (one query binary function maps multiple source functions),… ▽ More Binary2source function matching is a fundamental task for many security applications, including Software Component Analysis (SCA). The "1-to-1" mechanism has been applied in existing binary2source matching works, in which one binary function is matched against one source function. However, we discovered that such mapping could be "1-to-n" (one query binary function maps multiple source functions), due to the existence of function inlining. To help conduct binary2source function matching under function inlining, we propose a method named O2NMatcher to generate Source Function Sets (SFSs) as the matching target for binary functions with inlining. We first propose a model named ECOCCJ48 for inlined call site prediction. To train this model, we leverage the compilable OSS to generate a dataset with labeled call sites (inlined or not), extract several features from the call sites, and design a compiler-opt-based multi-label classifier by inspecting the inlining correlations between different compilations. Then, we use this model to predict the labels of call sites in the uncompilable OSS projects without compilation and obtain the labeled function call graphs of these projects. Next, we regard the construction of SFSs as a sub-tree generation problem and design root node selection and edge extension rules to construct SFSs automatically. Finally, these SFSs will be added to the corpus of source functions and compared with binary functions with inlining. We conduct several experiments to evaluate the effectiveness of O2NMatcher and results show our method increases the performance of existing works by 6% and exceeds all the state-of-the-art works. △ Less

Submitted 26 October, 2022; originally announced October 2022.

arXiv:2210.05883 [pdf, other]

AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning

Authors: Tao Yang, Jinghao Deng, Xiaojun Quan, Qifan Wang, Shaoliang Nie

Abstract: Fine-tuning large pre-trained language models on downstream tasks is apt to suffer from overfitting when limited training data is available. While dropout proves to be an effective antidote by randomly dropping a proportion of units, existing research has not examined its effect on the self-attention mechanism. In this paper, we investigate this problem through self-attention attribution and find… ▽ More Fine-tuning large pre-trained language models on downstream tasks is apt to suffer from overfitting when limited training data is available. While dropout proves to be an effective antidote by randomly dropping a proportion of units, existing research has not examined its effect on the self-attention mechanism. In this paper, we investigate this problem through self-attention attribution and find that dropping attention positions with low attribution scores can accelerate training and increase the risk of overfitting. Motivated by this observation, we propose Attribution-Driven Dropout (AD-DROP), which randomly discards some high-attribution positions to encourage the model to make predictions by relying more on low-attribution positions to reduce overfitting. We also develop a cross-tuning strategy to alternate fine-tuning and AD-DROP to avoid dropping high-attribution positions excessively. Extensive experiments on various benchmarks show that AD-DROP yields consistent improvements over baselines. Analysis further confirms that AD-DROP serves as a strategic regularizer to prevent overfitting during fine-tuning. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: Accepted to NeurIPS 2022

arXiv:2209.12152 [pdf, other]

All are Worth Words: A ViT Backbone for Diffusion Models

Authors: Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, Jun Zhu

Abstract: Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and emplo… ▽ More Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets. △ Less

Submitted 25 March, 2023; v1 submitted 25 September, 2022; originally announced September 2022.

Comments: Accepted to CVPR 2023

arXiv:2207.00779 [pdf, other]

FRAME: Evaluating Rationale-Label Consistency Metrics for Free-Text Rationales

Authors: Aaron Chan, Shaoliang Nie, Liang Tan, Xiaochang Peng, Hamed Firooz, Maziar Sanjabi, Xiang Ren

Abstract: Following how humans communicate, free-text rationales aim to use natural language to explain neural language model (LM) behavior. However, free-text rationales' unconstrained nature makes them prone to hallucination, so it is important to have metrics for free-text rationale quality. Existing free-text rationale metrics measure how consistent the rationale is with the LM's predicted label, but th… ▽ More Following how humans communicate, free-text rationales aim to use natural language to explain neural language model (LM) behavior. However, free-text rationales' unconstrained nature makes them prone to hallucination, so it is important to have metrics for free-text rationale quality. Existing free-text rationale metrics measure how consistent the rationale is with the LM's predicted label, but there is no protocol for assessing such metrics' reliability. Thus, we propose FRAME, a framework for evaluating rationale-label consistency (RLC) metrics for free-text rationales. FRAME is based on three axioms: (1) good metrics should yield highest scores for reference rationales, which maximize RLC by construction; (2) good metrics should be appropriately sensitive to semantic perturbation of rationales; and (3) good metrics should be robust to variation in the LM's task performance. Across three text classification datasets, we show that existing RLC metrics cannot satisfy all three FRAME axioms, since they are implemented via model pretraining which muddles the metric's signal. Then, we introduce a non-pretraining RLC metric that greatly outperforms baselines on (1) and (3), while performing competitively on (2). Finally, we discuss the limitations of using RLC to evaluate free-text rationales. △ Less

Submitted 2 December, 2022; v1 submitted 2 July, 2022; originally announced July 2022.

Comments: BlackboxNLP Workshop at EMNLP 2022

arXiv:2205.12542 [pdf, other]

ER-Test: Evaluating Explanation Regularization Methods for Language Models

Authors: Brihi Joshi, Aaron Chan, Ziyi Liu, Shaoliang Nie, Maziar Sanjabi, Hamed Firooz, Xiang Ren

Abstract: By explaining how humans would solve a given task, human rationales can provide strong learning signal for neural language models (LMs). Explanation regularization (ER) aims to improve LM generalization by pushing the LM's machine rationales (Which input tokens did the LM focus on?) to align with human rationales (Which input tokens would humans focus on?). Though prior works primarily study ER vi… ▽ More By explaining how humans would solve a given task, human rationales can provide strong learning signal for neural language models (LMs). Explanation regularization (ER) aims to improve LM generalization by pushing the LM's machine rationales (Which input tokens did the LM focus on?) to align with human rationales (Which input tokens would humans focus on?). Though prior works primarily study ER via in-distribution (ID) evaluation, out-of-distribution (OOD) generalization is often more critical in real-world scenarios, yet ER's effect on OOD generalization has been underexplored. In this paper, we introduce ER-Test, a framework for evaluating ER models' OOD generalization along three dimensions: unseen dataset tests, contrast set tests, and functional tests. Using ER-Test, we extensively analyze how ER models' OOD generalization varies with different ER design choices. Across two tasks and six datasets, ER-Test shows that ER has little impact on ID performance but can yield large OOD performance gains. Also, we find that ER can improve OOD performance even with limited rationale supervision. ER-Test's results help demonstrate ER's utility and establish best practices for using ER effectively. △ Less

Submitted 27 February, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: Findings of EMNLP 2022

arXiv:2204.09191 [pdf, other]

Unleashing the Power of Compiler Intermediate Representation to Enhance Neural Program Embeddings

Authors: Zongjie Li, Pingchuan Ma, Huaijin Wang, Shuai Wang, Qiyi Tang, Sen Nie, Shi Wu

Abstract: Neural program embeddings have demonstrated considerable promise in a range of program analysis tasks, including clone identification, program repair, code completion, and program synthesis. However, most existing methods generate neural program embeddings directly from the program source codes, by learning from features such as tokens, abstract syntax trees, and control flow graphs. This paper ta… ▽ More Neural program embeddings have demonstrated considerable promise in a range of program analysis tasks, including clone identification, program repair, code completion, and program synthesis. However, most existing methods generate neural program embeddings directly from the program source codes, by learning from features such as tokens, abstract syntax trees, and control flow graphs. This paper takes a fresh look at how to improve program embeddings by leveraging compiler intermediate representation (IR). We first demonstrate simple yet highly effective methods for enhancing embedding quality by training embedding models alongside source code and LLVM IR generated by default optimization levels (e.g., -O2). We then introduce IRGen, a framework based on genetic algorithms (GA), to identify (near-)optimal sequences of optimization flags that can significantly improve embedding quality. △ Less

Submitted 19 April, 2022; originally announced April 2022.

arXiv:2204.05990 [pdf, other]

Detection, Disambiguation, Re-ranking: Autoregressive Entity Linking as a Multi-Task Problem

Authors: Khalil Mrini, Shaoliang Nie, Jiatao Gu, Sinong Wang, Maziar Sanjabi, Hamed Firooz

Abstract: We propose an autoregressive entity linking model, that is trained with two auxiliary tasks, and learns to re-rank generated samples at inference time. Our proposed novelties address two weaknesses in the literature. First, a recent method proposes to learn mention detection and then entity candidate selection, but relies on predefined sets of candidates. We use encoder-decoder autoregressive enti… ▽ More We propose an autoregressive entity linking model, that is trained with two auxiliary tasks, and learns to re-rank generated samples at inference time. Our proposed novelties address two weaknesses in the literature. First, a recent method proposes to learn mention detection and then entity candidate selection, but relies on predefined sets of candidates. We use encoder-decoder autoregressive entity linking in order to bypass this need, and propose to train mention detection as an auxiliary task instead. Second, previous work suggests that re-ranking could help correct prediction errors. We add a new, auxiliary task, match prediction, to learn re-ranking. Without the use of a knowledge base or candidate sets, our model sets a new state of the art in two benchmark datasets of entity linking: COMETA in the biomedical domain, and AIDA-CoNLL in the news domain. We show through ablation studies that each of the two auxiliary tasks increases performance, and that re-ranking is an important factor to the increase. Finally, our low-resource experimental results suggest that performance on the main task benefits from the knowledge learned by the auxiliary tasks, and not just from the additional training data. △ Less

Submitted 12 April, 2022; originally announced April 2022.

Comments: Long paper accepted to ACL 2022 Findings

arXiv:2202.08433 [pdf, ps, other]

ADD 2022: the First Audio Deep Synthesis Detection Challenge

Authors: Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Xiaohui Zhang, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu

Abstract: Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake gam… ▽ More Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG). The LF track focuses on dealing with bona fide and fully fake utterances with various real-world noises etc. The PF track aims to distinguish the partially fake audio from the real. The FG track is a rivalry game, which includes two tasks: an audio generation task and an audio fake detection task. In this paper, we describe the datasets, evaluation metrics, and protocols. We also report major findings that reflect the recent advances in audio deepfake detection tasks. △ Less

Submitted 2 July, 2024; v1 submitted 16 February, 2022; originally announced February 2022.

Comments: Accepted by ICASSP 2022

arXiv:2201.00072 [pdf, other]

BARACK: Partially Supervised Group Robustness With Guarantees

Authors: Nimit S. Sohoni, Maziar Sanjabi, Nicolas Ballas, Aditya Grover, Shaoliang Nie, Hamed Firooz, Christopher Ré

Abstract: While neural networks have shown remarkable success on classification tasks in terms of average-case performance, they often fail to perform well on certain groups of the data. Such group information may be expensive to obtain; thus, recent works in robustness and fairness have proposed ways to improve worst-group performance even when group labels are unavailable for the training data. However, t… ▽ More While neural networks have shown remarkable success on classification tasks in terms of average-case performance, they often fail to perform well on certain groups of the data. Such group information may be expensive to obtain; thus, recent works in robustness and fairness have proposed ways to improve worst-group performance even when group labels are unavailable for the training data. However, these methods generally underperform methods that utilize group information at training time. In this work, we assume access to a small number of group labels alongside a larger dataset without group labels. We propose BARACK, a simple two-step framework to utilize this partial group information to improve worst-group performance: train a model to predict the missing group labels for the training data, and then use these predicted group labels in a robust optimization objective. Theoretically, we provide generalization bounds for our approach in terms of the worst-group performance, which scale with respect to both the total number of training points and the number of training points with group labels. Empirically, our method outperforms the baselines that do not use group information, even when only 1-33% of points have group labels. We provide ablation studies to support the robustness and extensibility of our framework. △ Less

Submitted 10 April, 2022; v1 submitted 31 December, 2021; originally announced January 2022.

Comments: 26 pages

arXiv:2112.12928 [pdf, other]

1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis

Authors: Ang Jia, Ming Fan, Wuxia Jin, Xi Xu, Zhaohui Zhou, Qiyi Tang, Sen Nie, Shi Wu, Ting Liu

Abstract: Binary similarity analysis is critical to many code-reuse-related issues and "1-to-1" mechanism is widely applied, where one function in a binary file is matched against one function in a source file or binary file. However, we discover that function mapping is a more complex problem of "1-to-n" or even "n-to-n" due to the existence of function inlining. In this paper, we investigate the effect… ▽ More Binary similarity analysis is critical to many code-reuse-related issues and "1-to-1" mechanism is widely applied, where one function in a binary file is matched against one function in a source file or binary file. However, we discover that function mapping is a more complex problem of "1-to-n" or even "n-to-n" due to the existence of function inlining. In this paper, we investigate the effect of function inlining on binary similarity analysis. We first construct 4 inlining-oriented datasets for four similarity analysis tasks, including code search, OSS reuse detection, vulnerability detection, and patch presence test. Then, we further study the extent of function inlining, the performance of existing works under function inlining, and the effectiveness of existing inlining-simulation strategies. Results show that the proportion of function inlining can reach nearly 70%, while most existing works neglect it and use "1-to-1" mechanism. The mismatches cause a 30% loss in performance during code search and a 40% loss during vulnerability detection. Moreover, two existing inlining-simulation strategies can only recover 60% of the inlined functions. We discover that inlining is usually cumulative when optimization increases. Conditional inlining and incremental inlining are suggested to design low-cost and high-coverage inlining-simulation strategies. △ Less

Submitted 5 May, 2022; v1 submitted 23 December, 2021; originally announced December 2021.

arXiv:2112.08802 [pdf, other]

UNIREX: A Unified Learning Framework for Language Model Rationale Extraction

Authors: Aaron Chan, Maziar Sanjabi, Lambert Mathias, Liang Tan, Shaoliang Nie, Xiaochang Peng, Xiang Ren, Hamed Firooz

Abstract: An extractive rationale explains a language model's (LM's) prediction on a given task instance by highlighting the text inputs that most influenced the prediction. Ideally, rationale extraction should be faithful (reflective of LM's actual behavior) and plausible (convincing to humans), without compromising the LM's (i.e., task model's) task performance. Although attribution algorithms and select-… ▽ More An extractive rationale explains a language model's (LM's) prediction on a given task instance by highlighting the text inputs that most influenced the prediction. Ideally, rationale extraction should be faithful (reflective of LM's actual behavior) and plausible (convincing to humans), without compromising the LM's (i.e., task model's) task performance. Although attribution algorithms and select-predict pipelines are commonly used in rationale extraction, they both rely on certain heuristics that hinder them from satisfying all three desiderata. In light of this, we propose UNIREX, a flexible learning framework that generalizes rationale extractor optimization as follows: (1) specify architecture for a learned rationale extractor; (2) select explainability objectives (i.e., faithfulness and plausibility criteria); and (3) jointly the train task model and rationale extractor on the task using the selected objectives. UNIREX enables replacing prior works' heuristic design choices with a generic learned rationale extractor in (1) and optimizing it for all three desiderata in (2)-(3). To facilitate comparison between methods with respect to multiple desiderata, we introduce the Normalized Relative Gain (NRG) metric. Across five text classification datasets, our best UNIREX configuration outperforms baselines by an average of 32.9% NRG. Plus, we find that UNIREX-trained rationale extractors can even generalize to unseen datasets and tasks. △ Less

Submitted 26 February, 2023; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: ICML 2022

arXiv:2104.00226 [pdf, other]

DF^2AM: Dual-level Feature Fusion and Affinity Modeling for RGB-Infrared Cross-modality Person Re-identification

Authors: Junhui Yin, Zhanyu Ma, Jiyang Xie, Shibo Nie, Kongming Liang, Jun Guo

Abstract: RGB-infrared person re-identification is a challenging task due to the intra-class variations and cross-modality discrepancy. Existing works mainly focus on learning modality-shared global representations by aligning image styles or feature distributions across modalities, while local feature from body part and relationships between person images are largely neglected. In this paper, we propose a… ▽ More RGB-infrared person re-identification is a challenging task due to the intra-class variations and cross-modality discrepancy. Existing works mainly focus on learning modality-shared global representations by aligning image styles or feature distributions across modalities, while local feature from body part and relationships between person images are largely neglected. In this paper, we propose a Dual-level (i.e., local and global) Feature Fusion (DF^2) module by learning attention for discriminative feature from local to global manner. In particular, the attention for a local feature is determined locally, i.e., applying a learned transformation function on itself. Meanwhile, to further mining the relationships between global features from person images, we propose an Affinities Modeling (AM) module to obtain the optimal intra- and inter-modality image matching. Specifically, AM employes intra-class compactness and inter-class separability in the sample similarities as supervised information to model the affinities between intra- and inter-modality samples. Experimental results show that our proposed method outperforms state-of-the-arts by large margins on two widely used cross-modality re-ID datasets SYSU-MM01 and RegDB, respectively. △ Less

Submitted 31 March, 2021; originally announced April 2021.

arXiv:2101.01881 [pdf, other]

MSD: Saliency-aware Knowledge Distillation for Multimodal Understanding

Authors: Woojeong Jin, Maziar Sanjabi, Shaoliang Nie, Liang Tan, Xiang Ren, Hamed Firooz

Abstract: To reduce a model size but retain performance, we often rely on knowledge distillation (KD) which transfers knowledge from a large "teacher" model to a smaller "student" model. However, KD on multimodal datasets such as vision-language tasks is relatively unexplored, and digesting multimodal information is challenging since different modalities present different types of information. In this paper… ▽ More To reduce a model size but retain performance, we often rely on knowledge distillation (KD) which transfers knowledge from a large "teacher" model to a smaller "student" model. However, KD on multimodal datasets such as vision-language tasks is relatively unexplored, and digesting multimodal information is challenging since different modalities present different types of information. In this paper, we perform a large-scale empirical study to investigate the importance and effects of each modality in knowledge distillation. Furthermore, we introduce a multimodal knowledge distillation framework, modality-specific distillation (MSD), to transfer knowledge from a teacher on multimodal tasks by learning the teacher's behavior within each modality. The idea aims at mimicking a teacher's modality-specific predictions by introducing auxiliary loss terms for each modality. Furthermore, because each modality has different saliency for predictions, we define saliency scores for each modality and investigate saliency-based weighting schemes for the auxiliary losses. We further study a weight learning approach to learn the optimal weights on these loss terms. In our empirical analysis, we examine the saliency of each modality in KD, demonstrate the effectiveness of the weighting scheme in MSD, and show that it achieves better performance than KD on four multimodal datasets. △ Less

Submitted 21 October, 2021; v1 submitted 6 January, 2021; originally announced January 2021.

Comments: Accepted to EMNLP 2021 Findings

arXiv:2008.07742 [pdf, other]

UDC 2020 Challenge on Image Restoration of Under-Display Camera: Methods and Results

Authors: Yuqian Zhou, Michael Kwan, Kyle Tolentino, Neil Emerton, Sehoon Lim, Tim Large, Lijiang Fu, Zhihong Pan, Baopu Li, Qirui Yang, Yihao Liu, Jigang Tang, Tao Ku, Shibin Ma, Bingnan Hu, Jiarong Wang, Densen Puthussery, Hrishikesh P S, Melvin Kuriakose, Jiji C V, Varun Sundar, Sumanth Hegde, Divya Kothandaraman, Kaushik Mitra, Akashdeep Jassal , et al. (20 additional authors not shown)

Abstract: This paper is the report of the first Under-Display Camera (UDC) image restoration challenge in conjunction with the RLQ workshop at ECCV 2020. The challenge is based on a newly-collected database of Under-Display Camera. The challenge tracks correspond to two types of display: a 4k Transparent OLED (T-OLED) and a phone Pentile OLED (P-OLED). Along with about 150 teams registered the challenge, ei… ▽ More This paper is the report of the first Under-Display Camera (UDC) image restoration challenge in conjunction with the RLQ workshop at ECCV 2020. The challenge is based on a newly-collected database of Under-Display Camera. The challenge tracks correspond to two types of display: a 4k Transparent OLED (T-OLED) and a phone Pentile OLED (P-OLED). Along with about 150 teams registered the challenge, eight and nine teams submitted the results during the testing phase for each track. The results in the paper are state-of-the-art restoration performance of Under-Display Camera Restoration. Datasets and paper are available at https://yzhouas.github.io/projects/UDC/udc.html. △ Less

Submitted 18 August, 2020; originally announced August 2020.

Comments: 15 pages

arXiv:2003.10270 [pdf, other]

Mobility-aware Beam Steering in Metasurface-based Programmable Wireless Environments

Authors: Christos Liaskos, Shuai Nie, Ageliki Tsioliaridou, Andreas Pitsillides, Sotiris Ioannidis, Ian Akyildiz

Abstract: Programmable wireless environments (PWEs) utilize electromagnetic metasurfaces to transform wireless propagation into a software-controlled resource. In this work we study the effects of user device mobility on the efficiency of PWEs. An analytical model is proposed, which describes the potential misalignment between user-emitted waves and the active PWE configuration, and can constitute the basis… ▽ More Programmable wireless environments (PWEs) utilize electromagnetic metasurfaces to transform wireless propagation into a software-controlled resource. In this work we study the effects of user device mobility on the efficiency of PWEs. An analytical model is proposed, which describes the potential misalignment between user-emitted waves and the active PWE configuration, and can constitute the basis for studying queuing problems in PWEs. Subsequently, a novel, beam steering approach is proposed which can effectively mitigate the misalignment effects. Ray-tracing-based simulations evaluate the proposed scheme. △ Less

Submitted 23 March, 2020; originally announced March 2020.

Comments: In proceedings of IEEE ICASSP 2020. This work was funded by the European Union via the Horizon 2020: Future Emerging Topics call (FETOPEN-RIA), grant EU736876, project VISORSURF (http://visorsurf.eu)

arXiv:1907.00037 [pdf, other]

3D Channel Modeling and Characterization for Hypersurface Empowered Indoor Environment at 60 GHz Millimeter-Wave Band

Authors: Rashi Mehrotra, Rafay Iqbal Ansari, Alexandros Pitilakis, Shuai Nie, Christos Liaskos, Nikolaos V. Kantartzis, Andreas Pitsillides

Abstract: This paper proposes a three-dimensional (3D) communication channel model for an indoor environment considering the effect of the Hypersurface. The Hypersurface is a software controlled intelligent metasurface, which can be used to manipulate electromagnetic waves, as for example for non-specular reflection and full absorption. Thus it can control the impinging rays from a transmitter towards a rec… ▽ More This paper proposes a three-dimensional (3D) communication channel model for an indoor environment considering the effect of the Hypersurface. The Hypersurface is a software controlled intelligent metasurface, which can be used to manipulate electromagnetic waves, as for example for non-specular reflection and full absorption. Thus it can control the impinging rays from a transmitter towards a receiver location in both LOS and NLOS paths, e.g. to combat distance and improve wireless connectivity. We focus on the 60 GHz mmWave frequency band due to its increasing significance in 5G/6G networks and evaluate the effect of Hypersurface in an indoor environment in terms of attenuation coefficients related to the Hypersurface reflection and absorption functionalities, using CST simulation, a 3D electromagnetic simulator of high frequency components. To highlight the benefits of Hypersurface coated walls versus plain walls, we use the derived Hypersurface 3D channel model and a custom 3D ray-tracing simulator for plain walls considering a typical indoor scenario for different Tx-Rx location and separation distances. △ Less

Submitted 28 June, 2019; originally announced July 2019.

Comments: Accepted

arXiv:1905.02495 [pdf, other]

An Interpretable Neural Network for Configuring Programmable Wireless Environments

Authors: Christos Liaskos, Ageliki Tsioliaridou, Shuai Nie, Andreas Pitsillides, Sotiris Ioannidis, Ian Akyildiz

Abstract: Software-defined metasurfaces (SDMs) comprise a dense topology of basic elements called meta-atoms, exerting the highest degree of control over surface currents among intelligent panel technologies. As such, they can transform impinging electromagnetic (EM) waves in complex ways, modifying their direction, power, frequency spectrum, polarity and phase. A well-defined software interface allows for… ▽ More Software-defined metasurfaces (SDMs) comprise a dense topology of basic elements called meta-atoms, exerting the highest degree of control over surface currents among intelligent panel technologies. As such, they can transform impinging electromagnetic (EM) waves in complex ways, modifying their direction, power, frequency spectrum, polarity and phase. A well-defined software interface allows for applying such functionalities to waves and inter-networking SDMs, while abstracting the underlying physics. A network of SDMs deployed over objects within an area, such as a floorplan walls, creates programmable wireless environments (PWEs) with fully customizable propagation of waves within them. This work studies the use of machine learning for configuring such environments to the benefit of users within. The methodology consists of modeling wireless propagation as a custom, interpretable, back-propagating neural network, with SDM elements as nodes and their cross-interactions as links. Following a training period the network learns the propagation basics of SDMs and configures them to facilitate the communication of users within their vicinity. △ Less

Submitted 7 May, 2019; originally announced May 2019.

Comments: In proceedings of IEEE SPAWC 2019 - Special Session on Signal Processing Advances for Emerging Transceiver Hardware. This work was funded by the European Union via the Horizon 2020: Future Emerging Topics call (FETOPEN), grant EU736876, project VISORSURF (http://www.visorsurf.eu)

arXiv:1812.11429 [pdf, other]

doi 10.1109/TNET.2019.2925658

Modeling, Simulating and Configuring Programmable Wireless Environments for Multi-User Multi-Objective Networking

Authors: Christos Liaskos, Ageliki Tsioliaridou, Shuai Nie, Andreas Pitsillides, Sotiris Ioannidis, Ian Akyildiz

Abstract: Programmable wireless environments enable the software-defined propagation of waves within them, yielding exceptional performance potential. Several building-block technologies have been implemented and evaluated at the physical layer. The present work contributes a network-layer scheme to configure such environments for multiple users and objectives, and for any physical-layer technology. Support… ▽ More Programmable wireless environments enable the software-defined propagation of waves within them, yielding exceptional performance potential. Several building-block technologies have been implemented and evaluated at the physical layer. The present work contributes a network-layer scheme to configure such environments for multiple users and objectives, and for any physical-layer technology. Supported objectives include any combination of Quality of Service and power transfer optimization, eavesdropping and Doppler effect mitigation, in multi-cast or uni-cast settings. Additionally, a graph-based model of programmable environments is proposed, which incorporates core physical observations and efficiently separates physical and networking concerns. Evaluation takes place in a specially developed, free simulation tool, and in a variety of environments. Performance gains over regular propagation are highlighted, reaching important insights on the user capacity of programmable environments. △ Less

Submitted 29 December, 2018; originally announced December 2018.

Comments: This work is part of project VISORSURF: A HyperVisor for Metasurface Functionalities (www.visorsurf.eu). Funded by the European Union Horizon 2020, under the Future Emerging Technologies - Research and Innovation Actions call (Grant Agreement EU 736876)

Report number: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=90

Journal ref: IEEE/ACM Transactions on Networking, volume: 27 , issue: 4 , pp. 1696-1713, Aug. 2019

arXiv:1812.07096 [pdf, other]

doi 10.1016/j.adhoc.2018.11.001

A Novel Communication Paradigm for High Capacity and Security via Programmable Indoor Wireless Environments in Next Generation Wireless Systems

Authors: Christos Liaskos, Shuai Nie, Ageliki Tsioliaridou, Andreas Pitsillides, Sotiris Ioannidis, Ian Akyildiz

Abstract: Wireless communication environments comprise passive objects that cause performance degradation and eavesdropping concerns due to anomalous scattering. This paper proposes a new paradigm, where scattering becomes software-defined and, subsequently, optimizable across wide frequency ranges. Through the proposed programmable wireless environments, the path loss, multi-path fading and interference ef… ▽ More Wireless communication environments comprise passive objects that cause performance degradation and eavesdropping concerns due to anomalous scattering. This paper proposes a new paradigm, where scattering becomes software-defined and, subsequently, optimizable across wide frequency ranges. Through the proposed programmable wireless environments, the path loss, multi-path fading and interference effects can be controlled and mitigated. Moreover, the eavesdropping can be prevented via novel physical layer security capabilities. The core technology of this new paradigm is the concept of metasurfaces, which are planar intelligent structures whose effects on impinging electromagnetic waves are fully defined by their micro-structure. Their control over impinging waves has been demonstrated to span from 1 GHz to 10 THz. This paper contributes the software-programmable wireless environment, consisting of several HyperSurface tiles (programmable metasurfaces) controlled by a central server. HyperSurfaces are a novel class of metasurfaces whose structure and, hence, electromagnetic behavior can be altered and controlled via a software interface. Multiple networked tiles coat indoor objects, allowing fine-grained, customizable reflection, absorption or polarization overall. A central server calculates and deploys the optimal electromagnetic interaction per tile, to the benefit of communicating devices. Realistic simulations using full 3D ray-tracing demonstrate the groundbreaking performance and security potential of the proposed approach in 2.4 GHz and 60 GHz frequencies. △ Less

Submitted 8 November, 2018; originally announced December 2018.

Comments: This work was partially funded by the European Union via the Horizon 2020: Future Emerging Topics call (FETOPEN), grant EU736876, project VISORSURF. admin note: significant overlap with arXiv:1805.06677

arXiv:1811.00883 [pdf, other]

Deep Segment Attentive Embedding for Duration Robust Speaker Verification

Authors: Bin Liu, Shuai Nie, Yaping Zhang, Shan Liang, Wenju Liu

Abstract: LSTM-based speaker verification usually uses a fixed-length local segment randomly truncated from an utterance to learn the utterance-level speaker embedding, while using the average embedding of all segments of a test utterance to verify the speaker, which results in a critical mismatch between testing and training. This mismatch degrades the performance of speaker verification, especially when t… ▽ More LSTM-based speaker verification usually uses a fixed-length local segment randomly truncated from an utterance to learn the utterance-level speaker embedding, while using the average embedding of all segments of a test utterance to verify the speaker, which results in a critical mismatch between testing and training. This mismatch degrades the performance of speaker verification, especially when the durations of training and testing utterances are very different. To alleviate this issue, we propose the deep segment attentive embedding method to learn the unified speaker embeddings for utterances of variable duration. Each utterance is segmented by a sliding window and LSTM is used to extract the embedding of each segment. Instead of only using one local segment, we use the whole utterance to learn the utterance-level embedding by applying an attentive pooling to the embeddings of all segments. Moreover, the similarity loss of segment-level embeddings is introduced to guide the segment attention to focus on the segments with more speaker discriminations, and jointly optimized with the similarity loss of utterance-level embeddings. Systematic experiments on Tongdun and VoxCeleb show that the proposed method significantly improves robustness of duration variant and achieves the relative Equal Error Rate reduction of 50% and 11.54% , respectively. △ Less

Submitted 31 October, 2018; originally announced November 2018.

arXiv:1807.07301 [pdf]

doi 10.3390/s18103307

A Swarming Approach to Optimize the One-hop Delay in Smart Driving Inter-platoon Communications

Authors: Qiong Wu, Shuzhen Nie, Pingyi Fan, Hanxu Liu, Fan Qiang, Zhengquan Li

Abstract: In this paper, we propose a swarming approach and optimize the one-hop delay for interplatoon communications through adjusting the minimum contention window size of each backbone vehicle in two steps. In the first step, we first set a small enough average one-hop delay as the initial optimization goal and then propose a swarming approach to find a minimum average one-hop delay for inter-platoon co… ▽ More In this paper, we propose a swarming approach and optimize the one-hop delay for interplatoon communications through adjusting the minimum contention window size of each backbone vehicle in two steps. In the first step, we first set a small enough average one-hop delay as the initial optimization goal and then propose a swarming approach to find a minimum average one-hop delay for inter-platoon communications through adjusting the minimum contention window of each backbone vehicle iteratively. In the second step, we first set the minimum average one-hop delay found in the first step as the initial optimization goal and then adopt the swarming approach again to get the one-hop delay of each backbone vehicle balance to the minimum average one-hop delay. The optimal minimum contention window sizes that get the one-hop delay of each backbone vehicle balance to the minimum average one-hop delay are obtained after the second step. The simulation results indicate that the one-hop delay is optimized and the other performance metrics including end-to-end delay, one-hop throughput and transmission probability are presented by using the optimal minimum contention window sizes. △ Less

Submitted 2 November, 2020; v1 submitted 19 July, 2018; originally announced July 2018.

Comments: published by sensors. Simulation codes are available online at https://codeocean.com/2018/06/28/code-for-colon-a-swarming-approach-to

Journal ref: Sensors 2018, 18(10), 3307

arXiv:1806.01792 [pdf, other]

A New Wireless Communication Paradigm through Software-controlled Metasurfaces

Authors: Christos Liaskos, Shuai Nie, Ageliki Tsioliaridou, Andreas Pitsillides, Sotiris Ioannidis, Ian Akyildiz

Abstract: Electromagnetic waves undergo multiple uncontrollable alterations as they propagate within a wireless environment. Free space path loss, signal absorption, as well as reflections, refractions and diffractions caused by physical objects within the environment highly affect the performance of wireless communications. Currently, such effects are intractable to account for and are treated as probabili… ▽ More Electromagnetic waves undergo multiple uncontrollable alterations as they propagate within a wireless environment. Free space path loss, signal absorption, as well as reflections, refractions and diffractions caused by physical objects within the environment highly affect the performance of wireless communications. Currently, such effects are intractable to account for and are treated as probabilistic factors. The paper proposes a radically different approach, enabling deterministic, programmable control over the behavior of the wireless environments. The key-enabler is the so-called HyperSurface tile, a novel class of planar meta-materials which can interact with impinging electromagnetic waves in a controlled manner. The HyperSurface tiles can effectively re-engineer electromagnetic waves, including steering towards any desired direction, full absorption, polarization manipulation and more. Multiple tiles are employed to coat objects such as walls, furniture, overall, any objects in the indoor and outdoor environments. An external software service calculates and deploys the optimal interaction types per tile, to best fit the needs of communicating devices. Evaluation via simulations highlights the potential of the new concept. △ Less

Submitted 4 June, 2018; originally announced June 2018.

Comments: Paper accepted for publication at the IEEE Communications Magazine. This work was funded by the European Union via the Horizon 2020: Future Emerging Topics call (FETOPEN-RIA), grant EU736876, project VISORSURF: HyperSurfaces-A Hardware Platform for Software-driven Functional Metasurfaces (http://www.visorsurf.eu/)

arXiv:1805.06677 [pdf, other]

Realizing Wireless Communication through Software-defined HyperSurface Environments

Authors: Christos Liaskos, Shuai Nie, Ageliki Tsioliaridou, Andreas Pitsillides, Sotiris Ioannidis, Ian Akyildiz

Abstract: Wireless communication environments are unaware of the ongoing data exchange efforts within them. Moreover, their effect on the communication quality is intractable in all but the simplest cases. The present work proposes a new paradigm, where indoor scattering becomes software-defined and, subsequently, optimizable across wide frequency ranges. Moreover, the controlled scattering can surpass natu… ▽ More Wireless communication environments are unaware of the ongoing data exchange efforts within them. Moreover, their effect on the communication quality is intractable in all but the simplest cases. The present work proposes a new paradigm, where indoor scattering becomes software-defined and, subsequently, optimizable across wide frequency ranges. Moreover, the controlled scattering can surpass natural behavior, exemplary overriding Snell's law, reflecting waves towards any custom angle (including negative ones). Thus, path loss and multi-path fading effects can be controlled and mitigated. The core technology of this new paradigm are metasurfaces, planar artificial structures whose effect on impinging electromagnetic waves is fully defined by their macro-structure. The present study contributes the software-programmable wireless environment model, consisting of several HyperSurface tiles controlled by a central, environment configuration server. HyperSurfaces are a novel class of metasurfaces whose structure and, hence, electromagnetic behavior can be altered and controlled via a software interface. Multiple networked tiles coat indoor objects, allowing fine-grained, customizable reflection, absorption or polarization overall. A central server calculates and deploys the optimal electromagnetic interaction per tile, to the benefit of communicating devices. Realistic simulations using full 3D ray-tracing demonstrate the groundbreaking potential of the proposed approach in 2.4 GHz and 60 GHz frequencies. △ Less

Submitted 17 May, 2018; originally announced May 2018.

Comments: This paper appears at the 19TH IEEE WOWMOM 2018, JUNE 12-15, 2018. (Technical program: http://it.murdoch.edu.au/wowmom2018/technical_program.html) This work was funded by the European Union via the Horizon 2020: Future Emerging Topics call (FETOPEN-RIA), grant EU736876, project VISORSURF (http://www.visorsurf.eu) : HyperSurfaces-A Hardware Platform for Software-driven Functional Metasurfaces

arXiv:1805.01357 [pdf, ps, other]

Boosting Noise Robustness of Acoustic Model via Deep Adversarial Training

Authors: Bin Liu, Shuai Nie, Yaping Zhang, Dengfeng Ke, Shan Liang, Wenju Liu1

Abstract: In realistic environments, speech is usually interfered by various noise and reverberation, which dramatically degrades the performance of automatic speech recognition (ASR) systems. To alleviate this issue, the commonest way is to use a well-designed speech enhancement approach as the front-end of ASR. However, more complex pipelines, more computations and even higher hardware costs (microphone a… ▽ More In realistic environments, speech is usually interfered by various noise and reverberation, which dramatically degrades the performance of automatic speech recognition (ASR) systems. To alleviate this issue, the commonest way is to use a well-designed speech enhancement approach as the front-end of ASR. However, more complex pipelines, more computations and even higher hardware costs (microphone array) are additionally consumed for this kind of methods. In addition, speech enhancement would result in speech distortions and mismatches to training. In this paper, we propose an adversarial training method to directly boost noise robustness of acoustic model. Specifically, a jointly compositional scheme of generative adversarial net (GAN) and neural network-based acoustic model (AM) is used in the training phase. GAN is used to generate clean feature representations from noisy features by the guidance of a discriminator that tries to distinguish between the true clean signals and generated signals. The joint optimization of generator, discriminator and AM concentrates the strengths of both GAN and AM for speech recognition. Systematic experiments on CHiME-4 show that the proposed method significantly improves the noise robustness of AM and achieves the average relative error rate reduction of 23.38% and 11.54% on the development and test set, respectively. △ Less

Submitted 2 May, 2018; originally announced May 2018.

arXiv:1803.00757 [pdf, other]

Gesture-based Piloting of an Aerial Robot using Monocular Vision

Authors: Ting Sun, Shengyi Nie, Dit-Yan Yeung, Shaojie Shen

Abstract: Aerial robots are becoming popular among general public, and with the development of artificial intelligence (AI), there is a trend to equip aerial robots with a natural user interface (NUI). Hand/arm gestures are an intuitive way to communicate for humans, and various research works have focused on controlling an aerial robot with natural gestures. However, the techniques in this area are still f… ▽ More Aerial robots are becoming popular among general public, and with the development of artificial intelligence (AI), there is a trend to equip aerial robots with a natural user interface (NUI). Hand/arm gestures are an intuitive way to communicate for humans, and various research works have focused on controlling an aerial robot with natural gestures. However, the techniques in this area are still far from mature. Many issues in this area have been poorly addressed, such as the principles of choosing gestures from the design point of view, hardware requirements from an economic point of view, considerations of data availability, and algorithm complexity from a practical perspective. Our work focuses on building an economical monocular system particularly designed for gesture-based piloting of an aerial robot. Natural arm gestures are mapped to rich target directions and convenient fine adjustment is achieved. Practical piloting scenarios, hardware cost and algorithm applicability are jointly considered in our system design. The entire system is successfully implemented in an aerial robot and various properties of the system are tested. △ Less

Submitted 2 March, 2018; originally announced March 2018.

arXiv:1801.07632 [pdf, other]

High Resolution Face Completion with Multiple Controllable Attributes via Fully End-to-End Progressive Generative Adversarial Networks

Authors: Zeyuan Chen, Shaoliang Nie, Tianfu Wu, Christopher G. Healey

Abstract: We present a deep learning approach for high resolution face completion with multiple controllable attributes (e.g., male and smiling) under arbitrary masks. Face completion entails understanding both structural meaningfulness and appearance consistency locally and globally to fill in "holes" whose content do not appear elsewhere in an input image. It is a challenging task with the difficulty leve… ▽ More We present a deep learning approach for high resolution face completion with multiple controllable attributes (e.g., male and smiling) under arbitrary masks. Face completion entails understanding both structural meaningfulness and appearance consistency locally and globally to fill in "holes" whose content do not appear elsewhere in an input image. It is a challenging task with the difficulty level increasing significantly with respect to high resolution, the complexity of "holes" and the controllable attributes of filled-in fragments. Our system addresses the challenges by learning a fully end-to-end framework that trains generative adversarial networks (GANs) progressively from low resolution to high resolution with conditional vectors encoding controllable attributes. We design novel network architectures to exploit information across multiple scales effectively and efficiently. We introduce new loss functions encouraging sharp completion. We show that our system can complete faces with large structural and appearance variations using a single feed-forward pass of computation with mean inference time of 0.007 seconds for images at 1024 x 1024 resolution. We also perform a pilot human study that shows our approach outperforms state-of-the-art face completion methods in terms of rank analysis. The code will be released upon publication. △ Less

Submitted 23 January, 2018; originally announced January 2018.

arXiv:1710.07831 [pdf, other]

doi 10.1016/j.cviu.2014.12.005

A Generative Restricted Boltzmann Machine Based Method for High-Dimensional Motion Data Modeling

Authors: Siqi Nie, Ziheng Wang, Qiang Ji

Abstract: Many computer vision applications involve modeling complex spatio-temporal patterns in high-dimensional motion data. Recently, restricted Boltzmann machines (RBMs) have been widely used to capture and represent spatial patterns in a single image or temporal patterns in several time slices. To model global dynamics and local spatial interactions, we propose to theoretically extend the conventional… ▽ More Many computer vision applications involve modeling complex spatio-temporal patterns in high-dimensional motion data. Recently, restricted Boltzmann machines (RBMs) have been widely used to capture and represent spatial patterns in a single image or temporal patterns in several time slices. To model global dynamics and local spatial interactions, we propose to theoretically extend the conventional RBMs by introducing another term in the energy function to explicitly model the local spatial interactions in the input data. A learning method is then proposed to perform efficient learning for the proposed model. We further introduce a new method for multi-class classification that can effectively estimate the infeasible partition functions of different RBMs such that RBM is treated as a generative model for classification purpose. The improved RBM model is evaluated on two computer vision applications: facial expression recognition and human action recognition. Experimental results on benchmark databases demonstrate the effectiveness of the proposed algorithm. △ Less

Submitted 21 October, 2017; originally announced October 2017.

Journal ref: Computer Vision and Image Understanding 136 (2015): 14-22

arXiv:1710.04809 [pdf, other]

Deep Regression Bayesian Network and Its Applications

Authors: Siqi Nie, Meng Zheng, Qiang Ji

Abstract: Deep directed generative models have attracted much attention recently due to their generative modeling nature and powerful data representation ability. In this paper, we review different structures of deep directed generative models and the learning and inference algorithms associated with the structures. We focus on a specific structure that consists of layers of Bayesian Networks due to the pro… ▽ More Deep directed generative models have attracted much attention recently due to their generative modeling nature and powerful data representation ability. In this paper, we review different structures of deep directed generative models and the learning and inference algorithms associated with the structures. We focus on a specific structure that consists of layers of Bayesian Networks due to the property of capturing inherent and rich dependencies among latent variables. The major difficulty of learning and inference with deep directed models with many latent variables is the intractable inference due to the dependencies among the latent variables and the exponential number of latent variable configurations. Current solutions use variational methods often through an auxiliary network to approximate the posterior probability inference. In contrast, inference can also be performed directly without using any auxiliary network to maximally preserve the dependencies among the latent variables. Specifically, by exploiting the sparse representation with the latent space, max-max instead of max-sum operation can be used to overcome the exponential number of latent configurations. Furthermore, the max-max operation and augmented coordinate ascent are applied to both supervised and unsupervised learning as well as to various inference. Quantitative evaluations on benchmark datasets of different models are given for both data representation and feature learning tasks. △ Less

Submitted 13 October, 2017; originally announced October 2017.

Comments: Accepted to IEEE Signal Processing Magazine

arXiv:1610.07090 [pdf, other]

STEPS: Predicting place attributes via spatio-temporal analysis

Authors: Shuxin Nie, Abhimanyu Das, Evgeniy Gabrilovich, Wei-Lwun Lu, Boris Mazniker, Chris Schilling

Abstract: In recent years, a vast amount of research has been conducted on learning people's interests from their actions. Yet their collective actions also allow us to learn something about the world, in particular, infer attributes of places people visit or interact with. Imagine classifying whether a hotel has a gym or a swimming pool, or whether a restaurant has a romantic atmosphere without ever asking… ▽ More In recent years, a vast amount of research has been conducted on learning people's interests from their actions. Yet their collective actions also allow us to learn something about the world, in particular, infer attributes of places people visit or interact with. Imagine classifying whether a hotel has a gym or a swimming pool, or whether a restaurant has a romantic atmosphere without ever asking its patrons. Algorithms we present can do just that. Many web applications rely on knowing attributes of places, for instance, whether a particular restaurant has WiFi or offers outdoor seating. Such data can be used to support a range of user experiences, from explicit query-driven search to personalized place recommendations. However, obtaining these attributes is generally difficult, with existing approaches relying on crowdsourcing or parsing online reviews, both of which are noisy, biased, and have limited coverage. Here we present a novel approach to classifying place attributes, which learns from patrons' visit patterns based on anonymous observational data. Our method, STEPS, learns from aggregated sequences of place visits. For example, if many people visit the restaurant on a Saturday evening, coming from a luxury hotel or theater, and stay for a long time, then this restaurant is more likely to have a romantic atmosphere. On the other hand, if most people visit the restaurant on weekdays, coming from work or a grocery store, then the restaurant is less likely to be romantic. We show that such transition features are highly predictive of place attributes. In an extensive empirical evaluation, STEPS nearly doubled the coverage of a state of the art approach thanks to learning from observational location data, which allowed our method to reason about many more places. △ Less

Submitted 22 October, 2016; originally announced October 2016.

arXiv:1506.04720 [pdf, other]

Latent Regression Bayesian Network for Data Representation

Authors: Siqi Nie, Qiang Ji

Abstract: Deep directed generative models have attracted much attention recently due to their expressive representation power and the ability of ancestral sampling. One major difficulty of learning directed models with many latent variables is the intractable inference. To address this problem, most existing algorithms make assumptions to render the latent variables independent of each other, either by desi… ▽ More Deep directed generative models have attracted much attention recently due to their expressive representation power and the ability of ancestral sampling. One major difficulty of learning directed models with many latent variables is the intractable inference. To address this problem, most existing algorithms make assumptions to render the latent variables independent of each other, either by designing specific priors, or by approximating the true posterior using a factorized distribution. We believe the correlations among latent variables are crucial for faithful data representation. Driven by this idea, we propose an inference method based on the conditional pseudo-likelihood that preserves the dependencies among the latent variables. For learning, we propose to employ the hard Expectation Maximization (EM) algorithm, which avoids the intractability of the traditional EM by max-out instead of sum-out to compute the data likelihood. Qualitative and quantitative evaluations of our model against state of the art deep models on benchmark datasets demonstrate the effectiveness of the proposed algorithm in data representation and reconstruction. △ Less

Submitted 15 June, 2015; originally announced June 2015.

arXiv:1406.1411 [pdf, other]

Advances in Learning Bayesian Networks of Bounded Treewidth

Authors: Siqi Nie, Denis Deratani Maua, Cassio Polpo de Campos, Qiang Ji

Abstract: This work presents novel algorithms for learning Bayesian network structures with bounded treewidth. Both exact and approximate methods are developed. The exact method combines mixed-integer linear programming formulations for structure learning and treewidth computation. The approximate method consists in uniformly sampling $k$-trees (maximal graphs of treewidth $k$), and subsequently selecting,… ▽ More This work presents novel algorithms for learning Bayesian network structures with bounded treewidth. Both exact and approximate methods are developed. The exact method combines mixed-integer linear programming formulations for structure learning and treewidth computation. The approximate method consists in uniformly sampling $k$-trees (maximal graphs of treewidth $k$), and subsequently selecting, exactly or approximately, the best structure whose moral graph is a subgraph of that $k$-tree. Some properties of these methods are discussed and proven. The approaches are empirically compared to each other and to a state-of-the-art method for learning bounded treewidth structures on a collection of public data sets with up to 100 variables. The experiments show that our exact algorithm outperforms the state of the art, and that the approximate approach is fairly accurate. △ Less

Submitted 6 June, 2014; v1 submitted 5 June, 2014; originally announced June 2014.

Comments: 23 pages, 2 figures, 3 tables

MSC Class: 68T37

Showing 1–49 of 49 results for author: Nie, S