Search | arXiv e-print repository

MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep Learning

Authors: Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng, Louis-Philippe Morency, Ruslan Salakhutdinov

Abstract: Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiZoo, a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms and MultiBench, a large-scale benchmark spanning 15 datase… ▽ More Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiZoo, a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms and MultiBench, a large-scale benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. Together, these provide an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, we offer a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench paves the way towards a better understanding of the capabilities and limitations of multimodal models, while ensuring ease of use, accessibility, and reproducibility. Our toolkits are publicly available, will be regularly updated, and welcome inputs from the community. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: JMLR Open Source Software 2023, Code available at https://github.com/pliang279/MultiBench

arXiv:2306.15700 [pdf, other]

Imitation with Spatial-Temporal Heatmap: 2nd Place Solution for NuPlan Challenge

Authors: Yihan Hu, Kun Li, Pingyuan Liang, Jingyu Qian, Zhening Yang, Haichao Zhang, Wenxin Shao, Zhuangzhuang Ding, Wei Xu, Qiang Liu

Abstract: This paper presents our 2nd place solution for the NuPlan Challenge 2023. Autonomous driving in real-world scenarios is highly complex and uncertain. Achieving safe planning in the complex multimodal scenarios is a highly challenging task. Our approach, Imitation with Spatial-Temporal Heatmap, adopts the learning form of behavior cloning, innovatively predicts the future multimodal states with a h… ▽ More This paper presents our 2nd place solution for the NuPlan Challenge 2023. Autonomous driving in real-world scenarios is highly complex and uncertain. Achieving safe planning in the complex multimodal scenarios is a highly challenging task. Our approach, Imitation with Spatial-Temporal Heatmap, adopts the learning form of behavior cloning, innovatively predicts the future multimodal states with a heatmap representation, and uses trajectory refinement techniques to ensure final safety. The experiment shows that our method effectively balances the vehicle's progress and safety, generating safe and comfortable trajectories. In the NuPlan competition, we achieved the second highest overall score, while obtained the best scores in the ego progress and comfort metrics. △ Less

Submitted 26 June, 2023; originally announced June 2023.

arXiv:2306.10599 [pdf, ps, other]

An Empirical Study of Untangling Patterns of Two-Class Dependency Cycles

Authors: Qiong Feng, Shuwen Liu, Huan Ji, Xiaotian Ma, Peng Liang

Abstract: Dependency cycles pose a significant challenge to software quality and maintainability. However, there is limited understanding of how practitioners resolve dependency cycles in real-world scenarios. This paper presents an empirical study investigating the recurring patterns employed by software developers to resolve dependency cycles between two classes in practice. We analyzed the data from 38 o… ▽ More Dependency cycles pose a significant challenge to software quality and maintainability. However, there is limited understanding of how practitioners resolve dependency cycles in real-world scenarios. This paper presents an empirical study investigating the recurring patterns employed by software developers to resolve dependency cycles between two classes in practice. We analyzed the data from 38 open-source projects across different domains and manually inspected hundreds of cycle untangling cases. Our findings reveal that developers tend to employ five recurring patterns to address dependency cycles. The chosen patterns are not only determined by dependency relations between cyclic classes, but also highly related to their design context, i.e., how cyclic classes depend on or are depended by their neighbor classes. Through this empirical study, we also discovered three common counterintuitive solutions developers usually adopted during cycles' handling. These recurring patterns and common counterintuitive solutions observed in dependency cycles' practice can serve as a taxonomy to improve developers' awareness and also be used as learning materials for students in software engineering and inexperienced developers. Our results also suggest that, in addition to considering the internal structure of dependency cycles, automatic tools need to consider the design context of cycles to provide better support for refactoring dependency cycles. △ Less

Submitted 17 December, 2023; v1 submitted 18 June, 2023; originally announced June 2023.

Comments: Preprint accepted for publication in Empirical Software Engineering, 2023

arXiv:2306.10015 [pdf, other]

Just One Byte (per gradient): A Note on Low-Bandwidth Decentralized Language Model Finetuning Using Shared Randomness

Authors: Eric Zelikman, Qian Huang, Percy Liang, Nick Haber, Noah D. Goodman

Abstract: Language model training in distributed settings is limited by the communication cost of gradient exchanges. In this short note, we extend recent work from Malladi et al. (2023), using shared randomness to perform distributed fine-tuning with low bandwidth. The method is a natural decentralized extension of memory-efficient Simultaneous Perturbation Stochastic Approximation (SPSA). Each iteration,… ▽ More Language model training in distributed settings is limited by the communication cost of gradient exchanges. In this short note, we extend recent work from Malladi et al. (2023), using shared randomness to perform distributed fine-tuning with low bandwidth. The method is a natural decentralized extension of memory-efficient Simultaneous Perturbation Stochastic Approximation (SPSA). Each iteration, each machine seeds a Random Number Generator (RNG) to perform local reproducible perturbations on model weights and calculate and exchange scalar projected gradients, which are then used to update each model. By using a (machine, sample) identifier as the random seed, each model can regenerate one another's perturbations. As machines only exchange single-byte projected gradients, this is highly communication efficient. There are also potential privacy benefits, as projected gradients may be calculated on different training data, and models never access the other's data. Our approach not only drastically reduces communication bandwidth requirements but also accommodates dynamic addition or removal of machines during the training process and retains the memory-efficient and inference-only advantages of recent work. We perform proof-of-concept experiments to demonstrate the potential usefulness of this method, building off of rich literature on distributed optimization and memory-efficient training. △ Less

Submitted 16 June, 2023; originally announced June 2023.

arXiv:2306.08620 [pdf, other]

Anticipatory Music Transformer

Authors: John Thickstun, David Hall, Chris Donahue, Percy Liang

Abstract: We introduce anticipation: a method for constructing a controllable generative model of a temporal point process (the event process) conditioned asynchronously on realizations of a second, correlated process (the control process). We achieve this by interleaving sequences of events and controls, such that controls appear following stopping times in the event sequence. This work is motivated by pro… ▽ More We introduce anticipation: a method for constructing a controllable generative model of a temporal point process (the event process) conditioned asynchronously on realizations of a second, correlated process (the control process). We achieve this by interleaving sequences of events and controls, such that controls appear following stopping times in the event sequence. This work is motivated by problems arising in the control of symbolic music generation. We focus on infilling control tasks, whereby the controls are a subset of the events themselves, and conditional generation completes a sequence of events given the fixed control events. We train anticipatory infilling models using the large and diverse Lakh MIDI music dataset. These models match the performance of autoregressive models for prompted music generation, with the additional capability to perform infilling control tasks, including accompaniment. Human evaluators report that an anticipatory model produces accompaniments with similar musicality to even music composed by humans over a 20-second clip. △ Less

Submitted 25 July, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

Comments: TMLR accepted version

arXiv:2306.08616 [pdf, other]

Towards Automated Identification of Violation Symptoms of Architecture Erosion

Authors: Ruiyin Li, Peng Liang, Paris Avgeriou

Abstract: Architecture erosion has a detrimental effect on maintenance and evolution, as the implementation drifts away from the intended architecture. To prevent this, development teams need to understand early enough the symptoms of erosion, and particularly violations of the intended architecture. One way to achieve this, is through the automated identification of architecture violations from textual art… ▽ More Architecture erosion has a detrimental effect on maintenance and evolution, as the implementation drifts away from the intended architecture. To prevent this, development teams need to understand early enough the symptoms of erosion, and particularly violations of the intended architecture. One way to achieve this, is through the automated identification of architecture violations from textual artifacts, and particularly code reviews. In this paper, we developed 15 machine learning-based and 4 deep learning-based classifiers with three pre-trained word embeddings to identify violation symptoms of architecture erosion from developer discussions in code reviews. Specifically, we looked at code review comments from four large open-source projects from the OpenStack (Nova and Neutron) and Qt (Qt Base and Qt Creator) communities. We then conducted a survey and semi-structured interviews to acquire feedback from the involved participants who discussed architecture violations in code reviews, to validate the usefulness of our trained classifiers. The results show that the SVM classifier based on word2vec pre-trained word embedding performs the best with an F1-score of 0.779. In most cases, classifiers with the fastText pre-trained word embedding model can achieve relatively good performance. Furthermore, 200-dimensional pre-trained word embedding models outperform classifiers that use 100 and 300-dimensional models. In addition, an ensemble classifier based on the majority voting strategy can further enhance the classifier and outperforms the individual classifiers. Finally, the findings derived from the online survey and interviews conducted with the involved developers reveal that the violation symptoms identified by our approaches have practical value and can provide early warnings for impending architecture erosion. △ Less

Submitted 29 April, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

Comments: 21 pages, 4 images, 7 tables, Revision submitted to TSE (2024)

arXiv:2306.07557 [pdf, other]

Ethical Aspects of ChatGPT in Software Engineering Research

Authors: Muhammad Azeem Akbar, Arif Ali Khan, Peng Liang

Abstract: ChatGPT can improve Software Engineering (SE) research practices by offering efficient, accessible information analysis and synthesis based on natural language interactions. However, ChatGPT could bring ethical challenges, encompassing plagiarism, privacy, data security, and the risk of generating biased or potentially detrimental data. This research aims to fill the given gap by elaborating on th… ▽ More ChatGPT can improve Software Engineering (SE) research practices by offering efficient, accessible information analysis and synthesis based on natural language interactions. However, ChatGPT could bring ethical challenges, encompassing plagiarism, privacy, data security, and the risk of generating biased or potentially detrimental data. This research aims to fill the given gap by elaborating on the key elements: motivators, demotivators, and ethical principles of using ChatGPT in SE research. To achieve this objective, we conducted a literature survey, identified the mentioned elements, and presented their relationships by developing a taxonomy. Further, the identified literature-based elements (motivators, demotivators, and ethical principles) were empirically evaluated by conducting a comprehensive questionnaire-based survey involving SE researchers. Additionally, we employed Interpretive Structure Modeling (ISM) approach to analyze the relationships between the ethical principles of using ChatGPT in SE research and develop a level based decision model. We further conducted a Cross-Impact Matrix Multiplication Applied to Classification (MICMAC) analysis to create a cluster-based decision model. These models aim to help SE researchers devise effective strategies for ethically integrating ChatGPT into SE research by following the identified principles through adopting the motivators and addressing the demotivators. The findings of this study will establish a benchmark for incorporating ChatGPT services in SE research with an emphasis on ethical considerations. △ Less

Submitted 13 August, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

Comments: Preprint accepted for publication in IEEE Transactions on Artificial Intelligence, 2023

arXiv:2306.05268 [pdf, other]

Factorized Contrastive Learning: Going Beyond Multi-view Redundancy

Authors: Paul Pu Liang, Zihao Deng, Martin Ma, James Zou, Louis-Philippe Morency, Ruslan Salakhutdinov

Abstract: In a wide range of multimodal tasks, contrastive learning has become a particularly appealing approach since it can successfully learn representations from abundant unlabeled data with only pairing information (e.g., image-caption or video-audio pairs). Underpinning these approaches is the assumption of multi-view redundancy - that shared information between modalities is necessary and sufficient… ▽ More In a wide range of multimodal tasks, contrastive learning has become a particularly appealing approach since it can successfully learn representations from abundant unlabeled data with only pairing information (e.g., image-caption or video-audio pairs). Underpinning these approaches is the assumption of multi-view redundancy - that shared information between modalities is necessary and sufficient for downstream tasks. However, in many real-world settings, task-relevant information is also contained in modality-unique regions: information that is only present in one modality but still relevant to the task. How can we learn self-supervised multimodal representations to capture both shared and unique information relevant to downstream tasks? This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy. FactorCL is built from three new contributions: (1) factorizing task-relevant information into shared and unique representations, (2) capturing task-relevant information via maximizing MI lower bounds and removing task-irrelevant information via minimizing MI upper bounds, and (3) multimodal data augmentations to approximate task relevance without labels. On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results on six benchmarks △ Less

Submitted 30 October, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: NeurIPS 2023. Code available at: https://github.com/pliang279/FactorCL

arXiv:2306.04597 [pdf, other]

Language Models Get a Gender Makeover: Mitigating Gender Bias with Few-Shot Data Interventions

Authors: Himanshu Thakur, Atishay Jain, Praneetha Vaddamanu, Paul Pu Liang, Louis-Philippe Morency

Abstract: Societal biases present in pre-trained large language models are a critical issue as these models have been shown to propagate biases in countless downstream applications, rendering them unfair towards specific groups of people. Since large-scale retraining of these models from scratch is both time and compute-expensive, a variety of approaches have been previously proposed that de-bias a pre-trai… ▽ More Societal biases present in pre-trained large language models are a critical issue as these models have been shown to propagate biases in countless downstream applications, rendering them unfair towards specific groups of people. Since large-scale retraining of these models from scratch is both time and compute-expensive, a variety of approaches have been previously proposed that de-bias a pre-trained model. While the majority of current state-of-the-art debiasing methods focus on changes to the training regime, in this paper, we propose data intervention strategies as a powerful yet simple technique to reduce gender bias in pre-trained models. Specifically, we empirically show that by fine-tuning a pre-trained model on only 10 de-biased (intervened) training examples, the tendency to favor any gender is significantly reduced. Since our proposed method only needs a few training examples, our few-shot debiasing approach is highly feasible and practical. Through extensive experimentation, we show that our debiasing technique performs better than competitive state-of-the-art baselines with minimal loss in language modeling ability. △ Less

Submitted 7 June, 2023; originally announced June 2023.

Comments: Accepted to ACL 2023 Main Conference

arXiv:2306.04539 [pdf, other]

Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications

Authors: Paul Pu Liang, Chun Kai Ling, Yun Cheng, Alex Obolenskiy, Yudong Liu, Rohan Pandey, Alex Wilf, Louis-Philippe Morency, Ruslan Salakhutdinov

Abstract: In many machine learning systems that jointly learn from multiple modalities, a core research question is to understand the nature of multimodal interactions: how modalities combine to provide new task-relevant information that was not present in either alone. We study this challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data and naturally co-occurri… ▽ More In many machine learning systems that jointly learn from multiple modalities, a core research question is to understand the nature of multimodal interactions: how modalities combine to provide new task-relevant information that was not present in either alone. We study this challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data and naturally co-occurring multimodal data (e.g., unlabeled images and captions, video and corresponding audio) but when labeling them is time-consuming. Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds to quantify the amount of multimodal interactions in this semi-supervised setting. We propose two lower bounds: one based on the shared information between modalities and the other based on disagreement between separately trained unimodal classifiers, and derive an upper bound through connections to approximate algorithms for min-entropy couplings. We validate these estimated bounds and show how they accurately track true interactions. Finally, we show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks. △ Less

Submitted 13 June, 2024; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: ICLR 2024, Code available at: https://github.com/pliang279/PID

arXiv:2306.04125 [pdf, other]

Multimodal Fusion Interactions: A Study of Human and Automatic Quantification

Authors: Paul Pu Liang, Yun Cheng, Ruslan Salakhutdinov, Louis-Philippe Morency

Abstract: In order to perform multimodal fusion of heterogeneous signals, we need to understand their interactions: how each modality individually provides information useful for a task and how this information changes in the presence of other modalities. In this paper, we perform a comparative study of how humans annotate two categorizations of multimodal interactions: (1) partial labels, where different a… ▽ More In order to perform multimodal fusion of heterogeneous signals, we need to understand their interactions: how each modality individually provides information useful for a task and how this information changes in the presence of other modalities. In this paper, we perform a comparative study of how humans annotate two categorizations of multimodal interactions: (1) partial labels, where different annotators annotate the label given the first, second, and both modalities, and (2) counterfactual labels, where the same annotator annotates the label given the first modality before asking them to explicitly reason about how their answer changes when given the second. We further propose an alternative taxonomy based on (3) information decomposition, where annotators annotate the degrees of redundancy: the extent to which modalities individually and together give the same predictions, uniqueness: the extent to which one modality enables a prediction that the other does not, and synergy: the extent to which both modalities enable one to make a prediction that one would not otherwise make using individual modalities. Through experiments and annotations, we highlight several opportunities and limitations of each approach and propose a method to automatically convert annotations of partial and counterfactual labels to information decomposition, yielding an accurate and efficient method for quantifying multimodal interactions. △ Less

Submitted 30 October, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: International Conference on Multimodal Interaction (ICMI '23), Code available at: https://github.com/pliang279/PID. arXiv admin note: text overlap with arXiv:2302.12247

arXiv:2306.04049 [pdf, other]

One-sided Matrix Completion from Two Observations Per Row

Authors: Steven Cao, Percy Liang, Gregory Valiant

Abstract: Given only a few observed entries from a low-rank matrix $X$, matrix completion is the problem of imputing the missing entries, and it formalizes a wide range of real-world settings that involve estimating missing data. However, when there are too few observed entries to complete the matrix, what other aspects of the underlying matrix can be reliably recovered? We study one such problem setting, t… ▽ More Given only a few observed entries from a low-rank matrix $X$, matrix completion is the problem of imputing the missing entries, and it formalizes a wide range of real-world settings that involve estimating missing data. However, when there are too few observed entries to complete the matrix, what other aspects of the underlying matrix can be reliably recovered? We study one such problem setting, that of "one-sided" matrix completion, where our goal is to recover the right singular vectors of $X$, even in the regime where recovering the left singular vectors is impossible, which arises when there are more rows than columns and very few observations. We propose a natural algorithm that involves imputing the missing values of the matrix $X^TX$ and show that even with only two observations per row in $X$, we can provably recover $X^TX$ as long as we have at least $Ω(r^2 d \log d)$ rows, where $r$ is the rank and $d$ is the number of columns. We evaluate our algorithm on one-sided recovery of synthetic data and low-coverage genome sequencing. In these settings, our algorithm substantially outperforms standard matrix completion and a variety of direct factorization methods. △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: ICML 2023

arXiv:2306.03262 [pdf, other]

Has the Machine Learning Review Process Become More Arbitrary as the Field Has Grown? The NeurIPS 2021 Consistency Experiment

Authors: Alina Beygelzimer, Yann N. Dauphin, Percy Liang, Jennifer Wortman Vaughan

Abstract: We present the NeurIPS 2021 consistency experiment, a larger-scale variant of the 2014 NeurIPS experiment in which 10% of conference submissions were reviewed by two independent committees to quantify the randomness in the review process. We observe that the two committees disagree on their accept/reject recommendations for 23% of the papers and that, consistent with the results from 2014, approxi… ▽ More We present the NeurIPS 2021 consistency experiment, a larger-scale variant of the 2014 NeurIPS experiment in which 10% of conference submissions were reviewed by two independent committees to quantify the randomness in the review process. We observe that the two committees disagree on their accept/reject recommendations for 23% of the papers and that, consistent with the results from 2014, approximately half of the list of accepted papers would change if the review process were randomly rerun. Our analysis suggests that making the conference more selective would increase the arbitrariness of the process. Taken together with previous research, our results highlight the inherent difficulty of objectively measuring the quality of research, and suggest that authors should not be excessively discouraged by rejected work. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2305.17311 [pdf, other]

Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models

Authors: Yuhui Zhang, Michihiro Yasunaga, Zhengping Zhou, Jeff Z. HaoChen, James Zou, Percy Liang, Serena Yeung

Abstract: Language models have been shown to exhibit positive scaling, where performance improves as models are scaled up in terms of size, compute, or data. In this work, we introduce NeQA, a dataset consisting of questions with negation in which language models do not exhibit straightforward positive scaling. We show that this task can exhibit inverse scaling, U-shaped scaling, or positive scaling, and th… ▽ More Language models have been shown to exhibit positive scaling, where performance improves as models are scaled up in terms of size, compute, or data. In this work, we introduce NeQA, a dataset consisting of questions with negation in which language models do not exhibit straightforward positive scaling. We show that this task can exhibit inverse scaling, U-shaped scaling, or positive scaling, and the three scaling trends shift in this order as we use more powerful prompting methods or model families. We hypothesize that solving NeQA depends on two subtasks: question answering (task 1) and negation understanding (task 2). We find that task 1 has linear scaling, while task 2 has sigmoid-shaped scaling with an emergent transition point, and composing these two scaling trends yields the final scaling trend of NeQA. Our work reveals and provides a way to analyze the complex scaling trends of language models. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Published at ACL 2023 Findings

arXiv:2305.16765 [pdf, other]

Backpack Language Models

Authors: John Hewitt, John Thickstun, Christopher D. Manning, Percy Liang

Abstract: We present Backpacks: a new neural architecture that marries strong modeling performance with an interface for interpretability and control. Backpacks learn multiple non-contextual sense vectors for each word in a vocabulary, and represent a word in a sequence as a context-dependent, non-negative linear combination of sense vectors in this sequence. We find that, after training, sense vectors spec… ▽ More We present Backpacks: a new neural architecture that marries strong modeling performance with an interface for interpretability and control. Backpacks learn multiple non-contextual sense vectors for each word in a vocabulary, and represent a word in a sequence as a context-dependent, non-negative linear combination of sense vectors in this sequence. We find that, after training, sense vectors specialize, each encoding a different aspect of a word. We can interpret a sense vector by inspecting its (non-contextual, linear) projection onto the output space, and intervene on these interpretable hooks to change the model's behavior in predictable ways. We train a 170M-parameter Backpack language model on OpenWebText, matching the loss of a GPT-2 small (124Mparameter) Transformer. On lexical similarity evaluations, we find that Backpack sense vectors outperform even a 6B-parameter Transformer LM's word embeddings. Finally, we present simple algorithms that intervene on sense vectors to perform controllable text generation and debiasing. For example, we can edit the sense vocabulary to tend more towards a topic, or localize a source of gender bias to a sense vector and globally suppress that sense. △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: ACL 2023 Camera-Ready

arXiv:2305.16349 [pdf, other]

Lexinvariant Language Models

Authors: Qian Huang, Eric Zelikman, Sarah Li Chen, Yuhuai Wu, Gregory Valiant, Percy Liang

Abstract: Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM). However, lexical symbol meanings can also be determined and even redefined by their structural role in a long context. In this paper, we ask: is it possible for a language model to be performant without \emph{any} fixed token embeddings? Such a language model would have to… ▽ More Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM). However, lexical symbol meanings can also be determined and even redefined by their structural role in a long context. In this paper, we ask: is it possible for a language model to be performant without \emph{any} fixed token embeddings? Such a language model would have to rely entirely on the co-occurence and repetition of tokens in the context rather than the \textit{a priori} identity of any token. To answer this, we study \textit{lexinvariant}language models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice. First, we prove that we can construct a lexinvariant LM to converge to the true language model at a uniform rate that is polynomial in terms of the context length, with a constant factor that is sublinear in the vocabulary size. Second, to build a lexinvariant LM, we simply encode tokens using random Gaussian vectors, such that each token maps to the same representation within each sequence but different representations across sequences. Empirically, we demonstrate that it can indeed attain perplexity comparable to that of a standard language model, given a sufficiently long context. We further explore two properties of the lexinvariant language models: First, given text generated from a substitution cipher of English, it implicitly implements Bayesian in-context deciphering and infers the mapping to the underlying real tokens with high accuracy. Second, it has on average 4X better accuracy over synthetic in-context reasoning tasks. Finally, we discuss regularizing standard language models towards lexinvariance and potential practical applications. △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2305.14577 [pdf, other]

Difference-Masking: Choosing What to Mask in Continued Pretraining

Authors: Alex Wilf, Syeda Nahida Akter, Leena Mathur, Paul Pu Liang, Sheryl Mathew, Mengrou Shou, Eric Nyberg, Louis-Philippe Morency

Abstract: The self-supervised objective of masking-and-predicting has led to promising performance gains on a variety of downstream tasks. However, while most approaches randomly mask tokens, there is strong intuition that deciding what to mask can substantially improve learning outcomes. We investigate this in continued pretraining setting in which pretrained models continue to pretrain on domain-specific… ▽ More The self-supervised objective of masking-and-predicting has led to promising performance gains on a variety of downstream tasks. However, while most approaches randomly mask tokens, there is strong intuition that deciding what to mask can substantially improve learning outcomes. We investigate this in continued pretraining setting in which pretrained models continue to pretrain on domain-specific data before performing some downstream task. We introduce Difference-Masking, a masking strategy that automatically chooses what to mask during continued pretraining by considering what makes a task domain different from the pretraining domain. Empirically, we find that Difference-Masking outperforms baselines on continued pretraining settings across four diverse language-only and multimodal video tasks. △ Less

Submitted 17 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.14387 [pdf, other]

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback

Authors: Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto

Abstract: Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their strong instruction-following abilities. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following requires tackling three major challenges: the high cost of data collection, the lack of trustworthy eva… ▽ More Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their strong instruction-following abilities. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following requires tackling three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these challenges with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM prompts to simulate human feedback that are 50x cheaper than crowdworkers and display high agreement with humans. Second, we propose an automatic evaluation and validate it against human instructions obtained on real-world interactions. Third, we contribute reference implementations for several methods (PPO, DPO, best-of-n, expert iteration, and more) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of real human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% improvement in win-rate against Davinci003. We release all components of AlpacaFarm at https://github.com/tatsu-lab/alpaca_farm. △ Less

Submitted 7 January, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: Spotlight at NeurIPS 2023

arXiv:2305.14342 [pdf, other]

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Authors: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma

Abstract: Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped St… ▽ More Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time. Theoretically, we show that Sophia, in a much simplified setting, adapts to the heterogeneous curvatures in different parameter dimensions, and thus has a run-time bound that does not depend on the condition number of the loss. △ Less

Submitted 5 March, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

arXiv:2305.13583 [pdf, other]

Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition

Authors: Yaoting Wang, Yuanchao Li, Paul Pu Liang, Louis-Philippe Morency, Peter Bell, Catherine Lai

Abstract: Fusing multiple modalities has proven effective for multimodal information processing. However, the incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. In this study, we first analyze how the salient affective information in one modality can be affected by the other, and demonstrate that inter-modal incongruity exists latently in crossmodal att… ▽ More Fusing multiple modalities has proven effective for multimodal information processing. However, the incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. In this study, we first analyze how the salient affective information in one modality can be affected by the other, and demonstrate that inter-modal incongruity exists latently in crossmodal attention. Based on this finding, we propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model, which dynamically chooses the primary modality in each training batch and reduces fusion times by leveraging the learned hierarchy in the latent space to alleviate incongruity. The experimental evaluation on five benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP (sentiment and emotion), where incongruity implicitly lies in hard samples, as well as UR-FUNNY (humour) and MUStaRD (sarcasm), where incongruity is common, verifies the efficacy of our approach, showing that HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention. △ Less

Submitted 12 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: *First two authors contributed equally

arXiv:2305.12600 [pdf, other]

PRODIGY: Enabling In-context Learning Over Graphs

Authors: Qian Huang, Hongyu Ren, Peng Chen, Gregor Kržmanc, Daniel Zeng, Percy Liang, Jure Leskovec

Abstract: In-context learning is the ability of a pretrained model to adapt to novel and diverse downstream tasks by conditioning on prompt examples, without optimizing any parameters. While large language models have demonstrated this ability, how in-context learning could be performed over graphs is unexplored. In this paper, we develop \textbf{Pr}etraining \textbf{O}ver \textbf{D}iverse \textbf{I}n-Conte… ▽ More In-context learning is the ability of a pretrained model to adapt to novel and diverse downstream tasks by conditioning on prompt examples, without optimizing any parameters. While large language models have demonstrated this ability, how in-context learning could be performed over graphs is unexplored. In this paper, we develop \textbf{Pr}etraining \textbf{O}ver \textbf{D}iverse \textbf{I}n-Context \textbf{G}raph S\textbf{y}stems (PRODIGY), the first pretraining framework that enables in-context learning over graphs. The key idea of our framework is to formulate in-context learning over graphs with a novel \emph{prompt graph} representation, which connects prompt examples and queries. We then propose a graph neural network architecture over the prompt graph and a corresponding family of in-context pretraining objectives. With PRODIGY, the pretrained model can directly perform novel downstream classification tasks on unseen graphs via in-context learning. We provide empirical evidence of the effectiveness of our framework by showcasing its strong in-context learning performance on tasks involving citation networks and knowledge graphs. Our approach outperforms the in-context learning accuracy of contrastive pretraining baselines with hard-coded adaptation by 18\% on average across all setups. Moreover, it also outperforms standard finetuning with limited data by 33\% on average with in-context learning. △ Less

Submitted 21 May, 2023; originally announced May 2023.

arXiv:2305.12369 [pdf, other]

HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer

Authors: Yubin Kim, Dong Won Lee, Paul Pu Liang, Sharifa Algohwinem, Cynthia Breazeal, Hae Won Park

Abstract: Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. By analyzing affect dynamics, we can gain insights into how people communicate, respond to different situations, and form relationships. However, modeling affect dynamics is challenging due to contextual fa… ▽ More Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. By analyzing affect dynamics, we can gain insights into how people communicate, respond to different situations, and form relationships. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of interpersonal relationships, the situation, and other factors that influence affective displays. To address this challenge, we propose a Cross-person Memory Transformer (CPM-T) framework which is able to explicitly model affective dynamics (intrapersonal and interpersonal influences) by identifying verbal and non-verbal cues, and with a large language model to utilize the pre-trained knowledge and perform verbal reasoning. The CPM-T framework maintains memory modules to store and update the contexts within the conversation window, enabling the model to capture dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and generalizability of our approach on three publicly available datasets for joint engagement, rapport, and human beliefs prediction tasks. Remarkably, the CPM-T framework outperforms baseline models in average F1-scores by up to 7.3%, 9.3%, and 2.0% respectively. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior. △ Less

Submitted 21 May, 2023; originally announced May 2023.

arXiv:2305.10429 [pdf, other]

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Authors: Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, Adams Wei Yu

Abstract: The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of do… ▽ More The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks. △ Less

Submitted 20 November, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023

arXiv:2305.06884 [pdf, ps, other]

Risk-limiting Financial Audits via Weighted Sampling without Replacement

Authors: Shubhanshu Shekhar, Ziyu Xu, Zachary C. Lipton, Pierre J. Liang, Aaditya Ramdas

Abstract: We introduce the notion of a risk-limiting financial auditing (RLFA): given $N$ transactions, the goal is to estimate the total misstated monetary fraction~($m^*$) to a given accuracy $ε$, with confidence $1-δ$. We do this by constructing new confidence sequences (CSs) for the weighted average of $N$ unknown values, based on samples drawn without replacement according to a (randomized) weighted sa… ▽ More We introduce the notion of a risk-limiting financial auditing (RLFA): given $N$ transactions, the goal is to estimate the total misstated monetary fraction~($m^*$) to a given accuracy $ε$, with confidence $1-δ$. We do this by constructing new confidence sequences (CSs) for the weighted average of $N$ unknown values, based on samples drawn without replacement according to a (randomized) weighted sampling scheme. Using the idea of importance weighting to construct test martingales, we first develop a framework to construct CSs for arbitrary sampling strategies. Next, we develop methods to improve the quality of CSs by incorporating side information about the unknown values associated with each item. We show that when the side information is sufficiently predictive, it can directly drive the sampling. Addressing the case where the accuracy is unknown a priori, we introduce a method that incorporates side information via control variates. Crucially, our construction is adaptive: if the side information is highly predictive of the unknown misstated amounts, then the benefits of incorporating it are significant; but if the side information is uncorrelated, our methods learn to ignore it. Our methods recover state-of-the-art bounds for the special case when the weights are equal, which has already found applications in election auditing. The harder weighted case solves our more challenging problem of AI-assisted financial auditing. △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: 23 pages, 8 figures, to appear in the Proceedings of Uncertainty in Artificial Intelligence (UAI) 2023

arXiv:2305.02440 [pdf, other]

Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

Authors: Deepak Narayanan, Keshav Santhanam, Peter Henderson, Rishi Bommasani, Tony Lee, Percy Liang

Abstract: Large language models (LLMs) power many state-of-the-art systems in natural language processing. However, these models are extremely computationally expensive, even at inference time, raising the natural question: when is the extra cost of deploying a larger model worth the anticipated boost in capabilities? Better understanding this tradeoff fundamentally could benefit from an inference efficienc… ▽ More Large language models (LLMs) power many state-of-the-art systems in natural language processing. However, these models are extremely computationally expensive, even at inference time, raising the natural question: when is the extra cost of deploying a larger model worth the anticipated boost in capabilities? Better understanding this tradeoff fundamentally could benefit from an inference efficiency metric that is both (i) easily comparable across models from different providers, and (ii) representative of the true cost of running queries in an isolated performance environment. Unfortunately, access to LLMs today is largely restricted to black-box text generation APIs and raw runtimes measured through this interface do not satisfy these desiderata: model providers can apply various software and hardware optimizations orthogonal to the model, and models served on shared infrastructure are susceptible to performance contention. To circumvent these problems, we propose a new metric for comparing inference efficiency across models. This metric puts models on equal footing as though they were served (i) on uniform hardware and software, and (ii) without performance contention. We call this metric the \emph{idealized runtime}, and we propose a methodology to efficiently estimate this metric for autoregressive Transformer models. We also propose cost-aware variants that incorporate the number of accelerators needed to serve the model. Using these metrics, we compare ten state-of-the-art LLMs to provide the first analysis of inference efficiency-capability tradeoffs; we make several observations from this analysis, including the fact that the superior inference runtime performance of certain APIs is often a byproduct of optimizations within the API rather than the underlying model. Our methodology also facilitates the efficient comparison of different software and hardware stacks. △ Less

Submitted 3 May, 2023; originally announced May 2023.

arXiv:2305.01216 [pdf, other]

doi 10.1088/0256-307X/40/7/070301

Stark tuning of telecom single-photon emitters based on a single Er$^{3+}$

Authors: Jian-Yin Huang, Peng-Jun Liang, Liang Zheng, Pei-Yun Li, You-Zhi Ma, Duan-Chen Liu, Zong-Quan Zhou, Chuan-Feng Li, Guang-Can Guo

Abstract: The implementation of scalable quantum networks requires photons at the telecom band and long-lived spin coherence. The single Er$^{3+}$ in solid-state hosts is an important candidate that fulfills these critical requirements simultaneously. However, to entangle distant Er$^{3+}$ ions through photonic connections, the emission frequency of individual Er$^{3+}$ in solid-state matrix must be the sam… ▽ More The implementation of scalable quantum networks requires photons at the telecom band and long-lived spin coherence. The single Er$^{3+}$ in solid-state hosts is an important candidate that fulfills these critical requirements simultaneously. However, to entangle distant Er$^{3+}$ ions through photonic connections, the emission frequency of individual Er$^{3+}$ in solid-state matrix must be the same, which is challenging because the emission frequency of Er$^{3+}$ depends on its local environment. Herein, we propose and experimentally demonstrate the Stark tuning of the emission frequency of a single Er$^{3+}$ in a Y$_2$SiO$_5$ crystal by employing electrodes interfaced with a silicon photonic crystal cavity. We obtain a Stark shift of 182.9 $\pm$ 0.8 MHz which is approximately 27 times of the optical emission linewidth, demonstrating the promising applications in tuning the emission frequency of independent Er$^{3+}$ into the same spectral channels. Our results provide a useful solution for construction of scalable quantum networks based on single Er$^{3+}$ and a universal tool for tuning emission of individual rare-earth ions. △ Less

Submitted 27 June, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

Journal ref: Chinese Phys. Lett. 40 070301 (2023)

arXiv:2304.11332 [pdf, other]

Input Augmentation with SAM: Boosting Medical Image Segmentation with Segmentation Foundation Model

Authors: Yizhe Zhang, Tao Zhou, Shuo Wang, Peixian Liang, Danny Z. Chen

Abstract: The Segment Anything Model (SAM) is a recently developed large model for general-purpose segmentation for computer vision tasks. SAM was trained using 11 million images with over 1 billion masks and can produce segmentation results for a wide range of objects in natural scene images. SAM can be viewed as a general perception model for segmentation (partitioning images into semantically meaningful… ▽ More The Segment Anything Model (SAM) is a recently developed large model for general-purpose segmentation for computer vision tasks. SAM was trained using 11 million images with over 1 billion masks and can produce segmentation results for a wide range of objects in natural scene images. SAM can be viewed as a general perception model for segmentation (partitioning images into semantically meaningful regions). Thus, how to utilize such a large foundation model for medical image segmentation is an emerging research target. This paper shows that although SAM does not immediately give high-quality segmentation for medical image data, its generated masks, features, and stability scores are useful for building and training better medical image segmentation models. In particular, we demonstrate how to use SAM to augment image input for commonly-used medical image segmentation models (e.g., U-Net). Experiments on three segmentation tasks show the effectiveness of our proposed SAMAug method. The code is available at \url{https://github.com/yizhezhang2000/SAMAug}. △ Less

Submitted 21 June, 2023; v1 submitted 22 April, 2023; originally announced April 2023.

Comments: GitHub: https://github.com/yizhezhang2000/SAMAug. Comments and questions are welcome

arXiv:2304.09848 [pdf, other]

Evaluating Verifiability in Generative Search Engines

Authors: Nelson F. Liu, Tianyi Zhang, Percy Liang

Abstract: Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human… ▽ More Generative search engines directly generate responses to user queries, along with in-line citations. A prerequisite trait of a trustworthy generative search engine is verifiability, i.e., systems should cite comprehensively (high citation recall; all statements are fully supported by citations) and accurately (high citation precision; every cite supports its associated statement). We conduct human evaluation to audit four popular generative search engines -- Bing Chat, NeevaAI, perplexity.ai, and YouChat -- across a diverse set of queries from a variety of sources (e.g., historical Google user queries, dynamically-collected open-ended questions on Reddit, etc.). We find that responses from existing generative search engines are fluent and appear informative, but frequently contain unsupported statements and inaccurate citations: on average, a mere 51.5% of generated sentences are fully supported by citations and only 74.5% of citations support their associated sentence. We believe that these results are concerningly low for systems that may serve as a primary tool for information-seeking users, especially given their facade of trustworthiness. We hope that our results further motivate the development of trustworthy generative search engines and help researchers and users better understand the shortcomings of existing commercial systems. △ Less

Submitted 23 October, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

Comments: 25 pages, 12 figures; to appear in Findings of EMNLP 2023

arXiv:2304.06819 [pdf, other]

Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction

Authors: Guillaume Jaume, Anurag Vaidya, Richard Chen, Drew Williamson, Paul Liang, Faisal Mahmood

Abstract: Integrating whole-slide images (WSIs) and bulk transcriptomics for predicting patient survival can improve our understanding of patient prognosis. However, this multimodal task is particularly challenging due to the different nature of these data: WSIs represent a very high-dimensional spatial description of a tumor, while bulk transcriptomics represent a global description of gene expression leve… ▽ More Integrating whole-slide images (WSIs) and bulk transcriptomics for predicting patient survival can improve our understanding of patient prognosis. However, this multimodal task is particularly challenging due to the different nature of these data: WSIs represent a very high-dimensional spatial description of a tumor, while bulk transcriptomics represent a global description of gene expression levels within that tumor. In this context, our work aims to address two key challenges: (1) how can we tokenize transcriptomics in a semantically meaningful and interpretable way?, and (2) how can we capture dense multimodal interactions between these two modalities? Specifically, we propose to learn biological pathway tokens from transcriptomics that can encode specific cellular functions. Together with histology patch tokens that encode the different morphological patterns in the WSI, we argue that they form appropriate reasoning units for downstream interpretability analyses. We propose fusing both modalities using a memory-efficient multimodal Transformer that can model interactions between pathway and histology patch tokens. Our proposed model, SURVPATH, achieves state-of-the-art performance when evaluated against both unimodal and multimodal baselines on five datasets from The Cancer Genome Atlas. Our interpretability framework identifies key multimodal prognostic factors, and, as such, can provide valuable insights into the interaction between genotype and phenotype, enabling a deeper understanding of the underlying biological mechanisms at play. We make our code public at: https://github.com/ajv012/SurvPath. △ Less

Submitted 15 April, 2024; v1 submitted 13 April, 2023; originally announced April 2023.

Comments: Accepted to CVPR 2024

arXiv:2304.03442 [pdf, other]

Generative Agents: Interactive Simulacra of Human Behavior

Authors: Joon Sung Park, Joseph C. O'Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, Michael S. Bernstein

Abstract: Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; t… ▽ More Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture--observation, planning, and reflection--each contribute critically to the believability of agent behavior. By fusing large language models with computational, interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior. △ Less

Submitted 5 August, 2023; v1 submitted 6 April, 2023; originally announced April 2023.

arXiv:2304.01620 [pdf, other]

Image Blind Denoising Using Dual Convolutional Neural Network with Skip Connection

Authors: Wencong Wu, Shicheng Liao, Guannan Lv, Peng Liang, Yungang Zhang

Abstract: In recent years, deep convolutional neural networks have shown fascinating performance in the field of image denoising. However, deeper network architectures are often accompanied with large numbers of model parameters, leading to high training cost and long inference time, which limits their application in practical denoising tasks. In this paper, we propose a novel dual convolutional blind denoi… ▽ More In recent years, deep convolutional neural networks have shown fascinating performance in the field of image denoising. However, deeper network architectures are often accompanied with large numbers of model parameters, leading to high training cost and long inference time, which limits their application in practical denoising tasks. In this paper, we propose a novel dual convolutional blind denoising network with skip connection (DCBDNet), which is able to achieve a desirable balance between the denoising effect and network complexity. The proposed DCBDNet consists of a noise estimation network and a dual convolutional neural network (CNN). The noise estimation network is used to estimate the noise level map, which improves the flexibility of the proposed model. The dual CNN contains two branches: a u-shaped sub-network is designed for the upper branch, and the lower branch is composed of the dilated convolution layers. Skip connections between layers are utilized in both the upper and lower branches. The proposed DCBDNet was evaluated on several synthetic and real-world image denoising benchmark datasets. Experimental results have demonstrated that the proposed DCBDNet can effectively remove gaussian noise in a wide range of levels, spatially variant noise and real noise. With a simple model structure, our proposed DCBDNet still can obtain competitive denoising performance compared to the state-of-the-art image denoising models containing complex architectures. Namely, a favorable trade-off between denoising performance and model complexity is achieved. Codes are available at https://github.com/WenCongWu/DCBDNet. △ Less

Submitted 4 April, 2023; originally announced April 2023.

arXiv:2304.01498 [pdf, other]

DCANet: Dual Convolutional Neural Network with Attention for Image Blind Denoising

Authors: Wencong Wu, Guannan Lv, Yingying Duan, Peng Liang, Yungang Zhang, Yuelong Xia

Abstract: Noise removal of images is an essential preprocessing procedure for many computer vision tasks. Currently, many denoising models based on deep neural networks can perform well in removing the noise with known distributions (i.e. the additive Gaussian white noise). However eliminating real noise is still a very challenging task, since real-world noise often does not simply follow one single type of… ▽ More Noise removal of images is an essential preprocessing procedure for many computer vision tasks. Currently, many denoising models based on deep neural networks can perform well in removing the noise with known distributions (i.e. the additive Gaussian white noise). However eliminating real noise is still a very challenging task, since real-world noise often does not simply follow one single type of distribution, and the noise may spatially vary. In this paper, we present a new dual convolutional neural network (CNN) with attention for image blind denoising, named as the DCANet. To the best of our knowledge, the proposed DCANet is the first work that integrates both the dual CNN and attention mechanism for image denoising. The DCANet is composed of a noise estimation network, a spatial and channel attention module (SCAM), and a CNN with a dual structure. The noise estimation network is utilized to estimate the spatial distribution and the noise level in an image. The noisy image and its estimated noise are combined as the input of the SCAM, and a dual CNN contains two different branches is designed to learn the complementary features to obtain the denoised image. The experimental results have verified that the proposed DCANet can suppress both synthetic and real noise effectively. The code of DCANet is available at https://github.com/WenCongWu/DCANet. △ Less

Submitted 16 June, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

arXiv:2304.01378 [pdf, other]

doi 10.3847/1538-4357/acc9c3

A Diverse Population of z ~ 2 ULIRGs Revealed by JWST Imaging

Authors: J. -S. Huang, Zi-Jian Li, Cheng Cheng, Meicun Hou, Haojing Yan, S. P. Willner, Y. -S. Dai, X. Z. Zheng, J. Pan, D. Rigopoulou, T. Wang, Zhiyuan Li, Piaoran Liang, A. Esamdin, G. G. Fazio

Abstract: Four ultra-luminous infrared galaxies (ULIRGs) observed with JWST/NIRcam in the Cosmos Evolution Early Release Science program offer an unbiased preview of the $z\approx2$ ULIRG population. The objects were originally selected at 24 $μ$m and have strong polycyclic aromatic hydrocarbon emission features observed with Spitzer/IRS. The four objects have similar stellar masses of ${\sim}10^{11}$ M… ▽ More Four ultra-luminous infrared galaxies (ULIRGs) observed with JWST/NIRcam in the Cosmos Evolution Early Release Science program offer an unbiased preview of the $z\approx2$ ULIRG population. The objects were originally selected at 24 $μ$m and have strong polycyclic aromatic hydrocarbon emission features observed with Spitzer/IRS. The four objects have similar stellar masses of ${\sim}10^{11}$ M$_\odot$ but otherwise are quite diverse. One is an isolated disk galaxy, but it has an active nucleus as shown by X-ray observations and by a bright point-source nucleus. Two others are merging pairs with mass ratios of 6-7:1. One has active nuclei in both components, while the other has only one active nucleus: the one in the less-massive neighbor, not the ULIRG. The fourth object is clumpy and irregular and is probably a merger, but there is no sign of an active nucleus. The intrinsic spectral energy distributions for the four AGNs in these systems are typical of type-2 QSOs. This study is consistent with the idea that even if internal processes can produce large luminosities at $z\sim2$, galaxy merging may still be necessary for the most luminous objects. The diversity of these four initial examples suggests that large samples will be needed to understand the $z\approx2$ ULIRG population. △ Less

Submitted 6 April, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

Comments: 9 pages, 3 figures, 1 table, accepted for publication in ApJ. V2 updates author affiliations and acknowledgments, not scientific content

arXiv:2304.00179 [pdf]

Shedding Light on Rechargeable Na/Cl$_2$ Battery

Authors: Guanzhou Zhu, Peng Liang, Cheng-Liang Huang, Shu-Chi Wu, Cheng-Chia Huang, Yuan-Yao Li, Shi-Kai Jiang, Wei-Hsiang Huang, Jiachen Li, Feifei Wang, Bing-Joe Hwang, Hongjie Dai

Abstract: Advancing new ideas of rechargeable batteries represents an important path to meeting the ever increasing energy storage needs. Recently we showed rechargeable sodium/chlorine (Na/Cl$_2$) (or lithium/chlorine Li/Cl$_2$) batteries that used a Na (or Li) metal negative electrode, a microporous amorphous carbon nanosphere (aCNS) positive electrode and an electrolyte containing dissolved AlCl$_3$ and… ▽ More Advancing new ideas of rechargeable batteries represents an important path to meeting the ever increasing energy storage needs. Recently we showed rechargeable sodium/chlorine (Na/Cl$_2$) (or lithium/chlorine Li/Cl$_2$) batteries that used a Na (or Li) metal negative electrode, a microporous amorphous carbon nanosphere (aCNS) positive electrode and an electrolyte containing dissolved AlCl$_3$ and fluoride additives in thionyl chloride (SOCl$_2$). The main battery redox reaction involved conversion between NaCl and Cl$_2$ trapped in the carbon positive electrode, delivering a cyclable capacity of up to 1200 mAh g$^{-1}$ (based on positive electrode mass) at a ~ 3.5 V discharge voltage. Here, we discovered by X-ray photoelectron spectroscopy (XPS) that upon charging a Na/Cl$_2$ battery, chlorination of carbon in the positive electrode occurred to form C-Cl accompanied by molecular Cl$_2$ infiltrating the porous aCNS, consistent with Cl$_2$ probed by mass spectrometry. Synchrotron X-ray diffraction observed the development of graphitic ordering in the initially amorphous aCNS under battery charging when the carbon matrix was oxidized/chlorinated and infiltrated with Cl$_2$. The C-Cl, Cl$_2$ species and graphitic ordering were reversible upon discharge, accompanied by NaCl formation. The results revealed redox conversion between NaCl and Cl$_2$, reversible graphitic ordering/amorphourization of carbon through battery charge/discharge, and for the first time probed trapped Cl$_2$ in porous carbon by XPS. △ Less

Submitted 31 March, 2023; originally announced April 2023.

Comments: 30 pages, 9 figures

arXiv:2303.18058 [pdf, other]

doi 10.1145/3593434.3593450

Code Reviewer Recommendation for Architecture Violations: An Exploratory Study

Authors: Ruiyin Li, Peng Liang, Paris Avgeriou

Abstract: Code review is a common practice in software development and often conducted before code changes are merged into the code repository. A number of approaches for automatically recommending appropriate reviewers have been proposed to match such code changes to pertinent reviewers. However, such approaches are generic, i.e., they do not focus on specific types of issues during code reviews. In this p… ▽ More Code review is a common practice in software development and often conducted before code changes are merged into the code repository. A number of approaches for automatically recommending appropriate reviewers have been proposed to match such code changes to pertinent reviewers. However, such approaches are generic, i.e., they do not focus on specific types of issues during code reviews. In this paper, we propose an approach that focuses on architecture violations, one of the most critical type of issues identified during code review. Specifically, we aim at automating the recommendation of code reviewers, who are potentially qualified to review architecture violations, based on reviews of code changes. To this end, we selected three common similarity detection methods to measure the file path similarity of code commits and the semantic similarity of review comments. We conducted a series of experiments on finding the appropriate reviewers through evaluating and comparing these similarity detection methods in separate and combined ways with the baseline reviewer recommendation approach, RevFinder. The results show that the common similarity detection methods can produce acceptable performance scores and achieve a better performance than RevFinder. The sampling techniques used in recommending code reviewers can impact the performance of reviewer recommendation approaches. We also discuss the potential implications of our findings for both researchers and practitioners. △ Less

Submitted 24 April, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

Comments: The 27th International Conference on Evaluation and Assessment in Software Engineering (EASE)

arXiv:2303.17548 [pdf, other]

Whose Opinions Do Language Models Reflect?

Authors: Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto

Abstract: Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs in response to subjective queries can have a profound impact, both on user satisfaction, as well as shaping the views of society at large. In this work, we put forth a quantitative framework to investigate the opinions reflected by LMs -- by leveraging high-quality public opinion polls and… ▽ More Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs in response to subjective queries can have a profound impact, both on user satisfaction, as well as shaping the views of society at large. In this work, we put forth a quantitative framework to investigate the opinions reflected by LMs -- by leveraging high-quality public opinion polls and their associated human responses. Using this framework, we create OpinionsQA, a new dataset for evaluating the alignment of LM opinions with those of 60 US demographic groups over topics ranging from abortion to automation. Across topics, we find substantial misalignment between the views reflected by current LMs and those of US demographic groups: on par with the Democrat-Republican divide on climate change. Notably, this misalignment persists even after explicitly steering the LMs towards particular demographic groups. Our analysis not only confirms prior observations about the left-leaning tendencies of some human feedback-tuned LMs, but also surfaces groups whose opinions are poorly reflected by current LMs (e.g., 65+ and widowed individuals). Our code and data are available at https://github.com/tatsu-lab/opinions_qa. △ Less

Submitted 30 March, 2023; originally announced March 2023.

arXiv:2303.15772 [pdf, other]

Ecosystem Graphs: The Social Footprint of Foundation Models

Authors: Rishi Bommasani, Dilara Soylu, Thomas I. Liao, Kathleen A. Creel, Percy Liang

Abstract: Foundation models (e.g. ChatGPT, StableDiffusion) pervasively influence society, warranting immediate social attention. While the models themselves garner much attention, to accurately characterize their impact, we must consider the broader sociotechnical ecosystem. We propose Ecosystem Graphs as a documentation framework to transparently centralize knowledge of this ecosystem. Ecosystem Graphs is… ▽ More Foundation models (e.g. ChatGPT, StableDiffusion) pervasively influence society, warranting immediate social attention. While the models themselves garner much attention, to accurately characterize their impact, we must consider the broader sociotechnical ecosystem. We propose Ecosystem Graphs as a documentation framework to transparently centralize knowledge of this ecosystem. Ecosystem Graphs is composed of assets (datasets, models, applications) linked together by dependencies that indicate technical (e.g. how Bing relies on GPT-4) and social (e.g. how Microsoft relies on OpenAI) relationships. To supplement the graph structure, each asset is further enriched with fine-grained metadata (e.g. the license or training emissions). We document the ecosystem extensively at https://crfm.stanford.edu/ecosystem-graphs/. As of March 16, 2023, we annotate 262 assets (64 datasets, 128 models, 70 applications) from 63 organizations linked by 356 dependencies. We show Ecosystem Graphs functions as a powerful abstraction and interface for achieving the minimum transparency required to address myriad use cases. Therefore, we envision Ecosystem Graphs will be a community-maintained resource that provides value to stakeholders spanning AI researchers, industry professionals, social scientists, auditors and policymakers. △ Less

Submitted 28 March, 2023; originally announced March 2023.

Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Ecosystem Graphs available at https://crfm.stanford.edu/ecosystem-graphs/

Journal ref: Published in AIES 2024

arXiv:2303.15715 [pdf, other]

Foundation Models and Fair Use

Authors: Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A. Lemley, Percy Liang

Abstract: Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If t… ▽ More Existing foundation models are trained on copyrighted material. Deploying these models can pose both legal and ethical risks when data creators fail to receive appropriate attribution or compensation. In the United States and several other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. However, there is a caveat: If the model produces output that is similar to copyrighted data, particularly in scenarios that affect the market of that data, fair use may no longer apply to the output of the model. In this work, we emphasize that fair use is not guaranteed, and additional work may be necessary to keep model development and deployment squarely in the realm of fair use. First, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We review relevant U.S. case law, drawing parallels to existing and potential applications for generating text, source code, and visual art. Experiments confirm that popular foundation models can generate content considerably similar to copyrighted material. Second, we discuss technical mitigations that can help foundation models stay in line with fair use. We argue that more research is needed to align mitigation strategies with the current state of the law. Lastly, we suggest that the law and technical mitigations should co-evolve. For example, coupled with other policy mechanisms, the law could more explicitly consider safe harbors when strong technical tools are used to mitigate infringement harms. This co-evolution may help strike a balance between intellectual property and innovation, which speaks to the original goal of fair use. But we emphasize that the strategies we describe here are not a panacea and more work is needed to develop policies that address the potential harms of foundation models. △ Less

Submitted 27 March, 2023; originally announced March 2023.

arXiv:2303.15198 [pdf, other]

Image Deblurring by Exploring In-depth Properties of Transformer

Authors: Pengwei Liang, Junjun Jiang, Xianming Liu, Jiayi Ma

Abstract: Image deblurring continues to achieve impressive performance with the development of generative models. Nonetheless, there still remains a displeasing problem if one wants to improve perceptual quality and quantitative scores of recovered image at the same time. In this study, drawing inspiration from the research of transformer properties, we introduce the pretrained transformers to address this… ▽ More Image deblurring continues to achieve impressive performance with the development of generative models. Nonetheless, there still remains a displeasing problem if one wants to improve perceptual quality and quantitative scores of recovered image at the same time. In this study, drawing inspiration from the research of transformer properties, we introduce the pretrained transformers to address this problem. In particular, we leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics. The pretrained transformer can capture the global topological relations (i.e., self-similarity) of image, and we observe that the captured topological relations about the sharp image will change when blur occurs. By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information, which is critical in measuring the sharpness of the deblurred image. On the basis of the advantages, we present two types of novel perceptual losses to guide image deblurring. One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space. The other type considers the features extracted from an image as a distribution and compares the distribution discrepancy between recovered image and target one. We demonstrate the effectiveness of transformer properties in improving the perceptual quality while not sacrificing the quantitative scores (PSNR) over the most competitive models, such as Uformer, Restormer, and NAFNet, on defocus deblurring and motion deblurring tasks. △ Less

Submitted 27 January, 2024; v1 submitted 24 March, 2023; originally announced March 2023.

Comments: accept by IEEE Transactions on Neural Networks and Learning Systems

Journal ref: IEEE Transactions on Neural Networks and Learning Systems 2024

arXiv:2303.14713 [pdf, other]

Engineering Software Systems for Quantum Computing as a Service: A Mapping Study

Authors: Aakash Ahmad, Muhammad Waseem, Peng Liang, Mahdi Fehmideh, Arif Ali Khan, David Georg Reichelt, Tommi Mikkonen

Abstract: Quantum systems have started to emerge as a disruptive technology and enabling platforms - exploiting the principles of quantum mechanics - to achieve quantum supremacy in computing. Academic research, industrial projects (e.g., Amazon Braket), and consortiums like 'Quantum Flagship' are striving to develop practically capable and commercially viable quantum computing (QC) systems and technologies… ▽ More Quantum systems have started to emerge as a disruptive technology and enabling platforms - exploiting the principles of quantum mechanics - to achieve quantum supremacy in computing. Academic research, industrial projects (e.g., Amazon Braket), and consortiums like 'Quantum Flagship' are striving to develop practically capable and commercially viable quantum computing (QC) systems and technologies. Quantum Computing as a Service (QCaaS) is viewed as a solution attuned to the philosophy of service-orientation that can offer QC resources and platforms, as utility computing, to individuals and organisations who do not own quantum computers. To understand the quantum service development life cycle and pinpoint emerging trends, we used evidence-based software engineering approach to conduct a systematic mapping study (SMS) of research that enables or enhances QCaaS. The SMS process retrieved a total of 55 studies, and based on their qualitative assessment we selected 9 of them to investigate (i) the functional aspects, design models, patterns, programming languages, deployment platforms, and (ii) trends of emerging research on QCaaS. The results indicate three modelling notations and a catalogue of five design patterns to architect QCaaS, whereas Python (native code or frameworks) and Amazon Braket are the predominant solutions to implement and deploy QCaaS solutions. From the quantum software engineering (QSE) perspective, this SMS provides empirically grounded findings that could help derive processes, patterns, and reference architectures to engineer software services for QC. △ Less

Submitted 26 March, 2023; originally announced March 2023.

arXiv:2303.11758 [pdf, other]

The Closed and Open Unbalanced Dicke Trimer Model: Critical Properties and Nonlinear Semiclassical Dynamics

Authors: Cheng Zhang, Pengfei Liang, Neill Lambert, Mauro Cirio

Abstract: We study a generalization of a recently introduced Dicke trimer model [Phys. Rev. Lett. 128, 163601, Phys. Rev. Research 5, L042016], which allows for cavity losses and unbalanced light-matter interactions (in which rotating and counter-rotating terms can be tuned independently). We find that in the extreme unbalanced limit, the $U(1)$ symmetry of the Tavis-Cummings model is restored, qualitativel… ▽ More We study a generalization of a recently introduced Dicke trimer model [Phys. Rev. Lett. 128, 163601, Phys. Rev. Research 5, L042016], which allows for cavity losses and unbalanced light-matter interactions (in which rotating and counter-rotating terms can be tuned independently). We find that in the extreme unbalanced limit, the $U(1)$ symmetry of the Tavis-Cummings model is restored, qualitatively altering the critical phenomena in the superradiant phase due to the presence of a zero-energy mode. To analyze this general regime, we develop a semiclassical theory based on a re-quantization technique. This theory also provides further physical insight on a recently reported anomalous finite critical fluctuations in the time-reversal broken regime. Moving to the open-Dicke case, by introducing local dissipation to the cavities, we observe the emergence of a rich range of nonequilibrium phases characterized by trivial and non-trivial dynamical signatures. In the former case, we identify, when time-reversal symmetry is present, a new stationary phase that features superradiant states in two of the three cavities and a normal state in the other cavity. In the latter case, we observe the emergence of dynamical phases in which the system exhibits superradiant oscillations, characterized by periodic or chaotic phase space patterns. The landscape of transitions associated with these dynamical phases features a wide range of qualitatively different behaviours such as Hopf bifurcations, anomalous Hopf bifurcations, collisions between basins of attraction, and exterior crises. We highlight how the two-critical-scalings feature of the closed model is robust under dissipation while the phenomenon of anomalous finite critical fluctuations becomes a mean-field scaling in the open model. △ Less

Submitted 2 November, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: 22 pages, 13 figures

arXiv:2303.08733 [pdf, other]

doi 10.18293/SEKE2023-077

Practices and Challenges of Using GitHub Copilot: An Empirical Study

Authors: Beiqi Zhang, Peng Liang, Xiyu Zhou, Aakash Ahmad, Muhammad Waseem

Abstract: With the advances in machine learning, there is a growing interest in AI-enabled tools for autocompleting source code. GitHub Copilot, also referred to as the "AI Pair Programmer", has been trained on billions of lines of open source GitHub code, and is one of such tools that has been increasingly used since its launch on June 2021. However, little effort has been devoted to understanding the prac… ▽ More With the advances in machine learning, there is a growing interest in AI-enabled tools for autocompleting source code. GitHub Copilot, also referred to as the "AI Pair Programmer", has been trained on billions of lines of open source GitHub code, and is one of such tools that has been increasingly used since its launch on June 2021. However, little effort has been devoted to understanding the practices and challenges of using Copilot in programming with auto-completed source code. To this end, we conducted an empirical study by collecting and analyzing the data from Stack Overflow (SO) and GitHub Discussions. More specifically, we searched and manually collected 169 SO posts and 655 GitHub discussions related to the usage of Copilot. We identified the programming languages, IDEs, technologies used with Copilot, functions implemented, benefits, limitations, and challenges when using Copilot. The results show that when practitioners use Copilot: (1) The major programming languages used with Copilot are JavaScript and Python, (2) the main IDE used with Copilot is Visual Studio Code, (3) the most common used technology with Copilot is Node.js, (4) the leading function implemented by Copilot is data processing, (5) the significant benefit of using Copilot is useful code generation, and (6) the main limitation encountered by practitioners when using Copilot is difficulty of integration. Our results suggest that using Copilot is like a double-edged sword, which requires developers to carefully consider various aspects when deciding whether or not to use it. Our study provides empirically grounded foundations and basis for future research on the role of Copilot as an AI pair programmer in software development. △ Less

Submitted 27 April, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: The 35th International Conference on Software Engineering and Knowledge Engineering (SEKE)

arXiv:2303.06865 [pdf, other]

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Authors: Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang

Abstract: The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generat… ▽ More The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen △ Less

Submitted 12 June, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

arXiv:2303.06822 [pdf, other]

doi 10.18293/DMSVIVA2023-071

Automatic Identification and Extraction of Assumptions on GitHub

Authors: Chen Yang, Zinan Ma, Peng Liang, Xiaohua Liu

Abstract: In software development, due to the lack of knowledge or information, time pressure, complex context, and many other factors, various uncertainties emerge during the development process, leading to assumptions scattered in projects. Being unaware of certain assumptions can result in critical problems (e.g., system vulnerability and failures). The prerequisite of analyzing and understanding assumpt… ▽ More In software development, due to the lack of knowledge or information, time pressure, complex context, and many other factors, various uncertainties emerge during the development process, leading to assumptions scattered in projects. Being unaware of certain assumptions can result in critical problems (e.g., system vulnerability and failures). The prerequisite of analyzing and understanding assumptions in software development is to identify and extract those assumptions with acceptable effort. In this paper, we proposed a tool (i.e., Assumption Miner) to automatically identify and extract assumptions on GitHub projects. To evaluate the applicability of Assumption Miner, we first presented an example of using the tool to mine assumptions from one large and popular deep learning framework project: the TensorFlow project on GitHub. We then conducted an evaluation of the tool. The results show that Assumption Miner can effectively identify and extract assumptions from the repositories on GitHub. △ Less

Submitted 25 April, 2023; v1 submitted 12 March, 2023; originally announced March 2023.

Comments: The 29th International DMS Conference on Visualization and Visual Languages (DMSVIVA)

arXiv:2303.02695 [pdf, other]

Understanding Bugs in Multi-Language Deep Learning Frameworks

Authors: Zengyang Li, Sicheng Wang, Wenshuo Wang, Peng Liang, Ran Mo, Bing Li

Abstract: Deep learning frameworks (DLFs) have been playing an increasingly important role in this intelligence age since they act as a basic infrastructure for an increasingly wide range of AIbased applications. Meanwhile, as multi-programming-language (MPL) software systems, DLFs are inevitably suffering from bugs caused by the use of multiple programming languages (PLs). Hence, it is of paramount signifi… ▽ More Deep learning frameworks (DLFs) have been playing an increasingly important role in this intelligence age since they act as a basic infrastructure for an increasingly wide range of AIbased applications. Meanwhile, as multi-programming-language (MPL) software systems, DLFs are inevitably suffering from bugs caused by the use of multiple programming languages (PLs). Hence, it is of paramount significance to understand the bugs (especially the bugs involving multiple PLs, i.e., MPL bugs) of DLFs, which can provide a foundation for preventing, detecting, and resolving bugs in the development of DLFs. To this end, we manually analyzed 1497 bugs in three MPL DLFs, namely MXNet, PyTorch, and TensorFlow. First, we classified bugs in these DLFs into 12 types (e.g., algorithm design bugs and memory bugs) according to their bug labels and characteristics. Second, we further explored the impacts of different bug types on the development of DLFs, and found that deployment bugs and memory bugs negatively impact the development of DLFs in different aspects the most. Third, we found that 28.6%, 31.4%, and 16.0% of bugs in MXNet, PyTorch, and TensorFlow are MPL bugs, respectively; the PL combination of Python and C/C++ is most used in fixing more than 92% MPL bugs in all DLFs. Finally, the code change complexity of MPL bug fixes is significantly greater than that of single-programming-language (SPL) bug fixes in all the three DLFs, while in PyTorch MPL bug fixes have longer open time and greater communication complexity than SPL bug fixes. These results provide insights for bug management in DLFs. △ Less

Submitted 5 March, 2023; originally announced March 2023.

Comments: The 31st IEEE/ACM International Conference on Program Comprehension (ICPC)

arXiv:2302.14600 [pdf, other]

Towards Human-Bot Collaborative Software Architecting with ChatGPT

Authors: Aakash Ahmad, Muhammad Waseem, Peng Liang, Mahdi Fehmideh, Mst Shamima Aktar, Tommi Mikkonen

Abstract: Architecting software-intensive systems can be a complex process. It deals with the daunting tasks of unifying stakeholders' perspectives, designers' intellect, tool-based automation, pattern-driven reuse, and so on, to sketch a blueprint that guides software implementation and evaluation. Despite its benefits, architecture-centric software engineering (ACSE) inherits a multitude of challenges. AC… ▽ More Architecting software-intensive systems can be a complex process. It deals with the daunting tasks of unifying stakeholders' perspectives, designers' intellect, tool-based automation, pattern-driven reuse, and so on, to sketch a blueprint that guides software implementation and evaluation. Despite its benefits, architecture-centric software engineering (ACSE) inherits a multitude of challenges. ACSE challenges could stem from a lack of standardized processes, socio-technical limitations, and scarcity of human expertise etc. that can impede the development of existing and emergent classes of software (e.g., IoTs, blockchain, quantum systems). Software Development Bots (DevBots) trained on large language models can help synergise architects' knowledge with artificially intelligent decision support to enable rapid architecting in a human-bot collaborative ACSE. An emerging solution to enable this collaboration is ChatGPT, a disruptive technology not primarily introduced for software engineering, but is capable of articulating and refining architectural artifacts based on natural language processing. We detail a case study that involves collaboration between a novice software architect and ChatGPT for architectural analysis, synthesis, and evaluation of a services-driven software application. Preliminary results indicate that ChatGPT can mimic an architect's role to support and often lead ACSE, however; it requires human oversight and decision support for collaborative architecting. Future research focuses on harnessing empirical evidence about architects' productivity and exploring socio-technical aspects of architecting with ChatGPT to tackle emerging and futuristic challenges of ACSE. △ Less

Submitted 26 February, 2023; originally announced February 2023.

Comments: 7 pages, 6 images, Manuscript submitted to a Conference (2023)

arXiv:2302.13289 [pdf, other]

Improving Representational Continuity via Continued Pretraining

Authors: Michael Sun, Ananya Kumar, Divyam Madaan, Percy Liang

Abstract: We consider the continual representation learning setting: sequentially pretrain a model $M'$ on tasks $T_1, \ldots, T_T$, and then adapt $M'$ on a small amount of data from each task $T_i$ to check if it has forgotten information from old tasks. Under a kNN adaptation protocol, prior work shows that continual learning methods improve forgetting over naive training (SGD). In reality, practitioners… ▽ More We consider the continual representation learning setting: sequentially pretrain a model $M'$ on tasks $T_1, \ldots, T_T$, and then adapt $M'$ on a small amount of data from each task $T_i$ to check if it has forgotten information from old tasks. Under a kNN adaptation protocol, prior work shows that continual learning methods improve forgetting over naive training (SGD). In reality, practitioners do not use kNN classifiers -- they use the adaptation method that works best (e.g., fine-tuning) -- here, we find that strong continual learning baselines do worse than naive training. Interestingly, we find that a method from the transfer learning community (LP-FT) outperforms naive training and the other continual learning methods. Even with standard kNN evaluation protocols, LP-FT performs comparably with strong continual learning methods (while being simpler and requiring less memory) on three standard benchmarks: sequential CIFAR-10, CIFAR-100, and TinyImageNet. LP-FT also reduces forgetting in a real world satellite remote sensing dataset (FMoW), and a variant of LP-FT gets state-of-the-art accuracies on an NLP continual learning benchmark. △ Less

Submitted 26 February, 2023; originally announced February 2023.

arXiv:2302.12766 [pdf, other]

Language-Driven Representation Learning for Robotics

Authors: Siddharth Karamcheti, Suraj Nair, Annie S. Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, Percy Liang

Abstract: Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks. Leveraging methods such as masked autoencoding and contrastive learning, these representations exhibit strong transfer to policy learning for visuomotor control. But, robot learning encompasses a diverse set of problems beyond control incl… ▽ More Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks. Leveraging methods such as masked autoencoding and contrastive learning, these representations exhibit strong transfer to policy learning for visuomotor control. But, robot learning encompasses a diverse set of problems beyond control including grasp affordance prediction, language-conditioned imitation learning, and intent scoring for human-robot collaboration, amongst others. First, we demonstrate that existing representations yield inconsistent results across these tasks: masked autoencoding approaches pick up on low-level spatial features at the cost of high-level semantics, while contrastive learning approaches capture the opposite. We then introduce Voltron, a framework for language-driven representation learning from human videos and associated captions. Voltron trades off language-conditioned visual reconstruction to learn low-level visual patterns, and visually-grounded language generation to encode high-level semantics. We also construct a new evaluation suite spanning five distinct robot learning problems $\unicode{x2013}$ a unified platform for holistically evaluating visual representations for robotics. Through comprehensive, controlled experiments across all five problems, we find that Voltron's language-driven representations outperform the prior state-of-the-art, especially on targeted problems requiring higher-level features. △ Less

Submitted 24 February, 2023; originally announced February 2023.

Comments: 30 Pages, 15 Figures

arXiv:2302.12247 [pdf, other]

Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework

Authors: Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, Ruslan Salakhutdinov, Louis-Philippe Morency

Abstract: The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimo… ▽ More The recent explosion of interest in multimodal applications has resulted in a wide selection of datasets and methods for representing and integrating information from different modalities. Despite these empirical advances, there remain fundamental research questions: How can we quantify the interactions that are necessary to solve a multimodal task? Subsequently, what are the most suitable multimodal models to capture these interactions? To answer these questions, we propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. We term these three measures as the PID statistics of a multimodal distribution (or PID for short), and introduce two new estimators for these PID statistics that scale to high-dimensional distributions. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks where PID estimations are compared with human annotations. Finally, we demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies engaging with domain experts in pathology, mood prediction, and robotic perception where our framework helps to recommend strong multimodal models for each application. △ Less

Submitted 10 December, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

Comments: NeurIPS 2023. Code available at: https://github.com/pliang279/PID

arXiv:2302.11861 [pdf, other]

Out-of-Domain Robustness via Targeted Augmentations

Authors: Irena Gao, Shiori Sagawa, Pang Wei Koh, Tatsunori Hashimoto, Percy Liang

Abstract: Models trained on one set of domains often suffer performance drops on unseen domains, e.g., when wildlife monitoring models are deployed in new camera locations. In this work, we study principles for designing data augmentations for out-of-domain (OOD) generalization. In particular, we focus on real-world scenarios in which some domain-dependent features are robust, i.e., some features that vary… ▽ More Models trained on one set of domains often suffer performance drops on unseen domains, e.g., when wildlife monitoring models are deployed in new camera locations. In this work, we study principles for designing data augmentations for out-of-domain (OOD) generalization. In particular, we focus on real-world scenarios in which some domain-dependent features are robust, i.e., some features that vary across domains are predictive OOD. For example, in the wildlife monitoring application above, image backgrounds vary across camera locations but indicate habitat type, which helps predict the species of photographed animals. Motivated by theoretical analysis on a linear setting, we propose targeted augmentations, which selectively randomize spurious domain-dependent features while preserving robust ones. We prove that targeted augmentations improve OOD performance, allowing models to generalize better with fewer domains. In contrast, existing approaches such as generic augmentations, which fail to randomize domain-dependent features, and domain-invariant augmentations, which randomize all domain-dependent features, both perform poorly OOD. In experiments on three real-world datasets, we show that targeted augmentations set new states-of-the-art for OOD performance by 3.2-15.2 percentage points. △ Less

Submitted 6 February, 2024; v1 submitted 23 February, 2023; originally announced February 2023.

Showing 101–150 of 481 results for author: Liang, P