Empowering Biomedical Discovery with AI Agents
Perspective Summary
We envision “AI scientists” as systems capable of skeptical learning and reasoning that empower biomedical research through collaborative agents that integrate AI models and biomedical tools with experimental platforms. Rather than taking humans out of the discovery process, biomedical AI agents combine human creativity and expertise with AI’s ability to analyze large datasets, navigate hypothesis spaces, and execute repetitive tasks. AI agents are poised to be proficient in various tasks, planning discovery workflows and performing self-assessment to identify and mitigate gaps in their knowledge. These agents use large language models and generative models to feature structured memory for continual learning and use machine learning tools to incorporate scientific knowledge, biological principles, and theories. AI agents can impact areas ranging from virtual cell simulation, programmable control of phenotypes, and the design of cellular circuits to developing new therapies.
Introduction
A long-standing ambition for artificial intelligence (AI) is the development of AI systems that can eventually make major scientific discoveries, learn on their own, and acquire knowledge autonomously. While this concept of an “AI scientist” is aspirational, advances in agent-based AI pave the way to the development of AI agents as conversable systems capable of reflective learning and reasoning that coordinate large language models (LLMs), machine learning (ML) tools, experimental platforms, or even combinations of them [1, 2, 3, 4] (Figure 1). The complexity of biological problems requires an approach where a complex problem is decomposed into simpler tasks. AI agents can break down a problem into manageable subtasks, which can then be addressed by agents with specialized functions for targeted problem-solving and integration of scientific knowledge [1, 5]. In the near future, AI agents can accelerate discovery workflows by making them faster and more resource-efficient. AI agents improve the efficiency of routine tasks, automate repetitive processes, and analyze large datasets to navigate hypothesis spaces at a scale and precision that surpasses current human-driven efforts. This automation allows for continuous, high-throughput research that would be impossible for human researchers to perform alone at the same scale or speed. Looking further ahead, AI agents can enable insights that might not have been possible using ML alone by making predictions across temporal and spatial scales prior to experimental measurements at those scales and can eventually identify new modes of behavior within biological systems [5].
This vision is possible thanks to advances in LLMs [6, 7, 8], multimodal learning, and generative models. Chat-optimized LLMs, such as GPT-4 [9], can incorporate feedback, enabling AI agents to cooperate through conversations with each other and with humans [10]. These conversations can involve agents seeking human feedback and critique, and identifying gaps in their knowledge [11, 12]. Then, since a single LLM can exhibit a broad range of capabilities—especially when configured with appropriate prompts and inference settings, conversations between differently configured agents can combine these capabilities in a modular manner [13]. LLMs have also demonstrated the ability to solve complex tasks by breaking them into subtasks [14, 15]. However, suppose we follow conventional approaches to foundation models such as LLMs and other large pre-trained models. In that case, we may not develop AI agents that can generate novel hypotheses because such novelty would not have been in the data used to train the model, suggesting that current foundation models alone are not sufficient for “AI scientists”. Using LLMs as a comparison, generating novel hypotheses requires creativity and grounding in scientific knowledge, whereas generating novel text requires adherence to semantic and syntactic rules [16], so the latter approach aligns well with techniques for next-token prediction within LLMs, while the former does not.
Here, we offer a perspective that “AI scientists” can be realized as AI agents backed by humans, LLMs, ML models, and other tools like experimental platforms that form a compound AI system. An AI agent should be able to formulate biomedical hypotheses, critically evaluate them, characterize their uncertainty, and use that as a driver to acquire and refine its scientific knowledge bases in a way that human scientists can trust [17]. AI agents should be designed to adapt to new biological insights, incorporate the latest scientific findings, and refine hypotheses based on experimental results. This adaptability ensures agents remain relevant in the face of rapidly evolving biological data [16], balancing between encoding new findings and retaining old knowledge [18].
Realizing this perspective shift, biomedical AI agents can impact areas ranging from virtual cell simulation, programmable control of phenotypes, and the design of cellular circuits to developing new therapies. Virtual cell simulation involves creating detailed models of cellular processes, where AI can predict the effects of genetic modifications or drug treatments on cell behavior. This can allow for an understanding of cellular mechanisms and generation of testable hypotheses, reducing the time and cost of traditional methods. Programmable control of phenotypes leverages AI agents to design precise genetic modifications to study gene functions. For example, CRISPR-based gene editing guided by an AI agent can activate or inhibit specific genes across large cell populations in a multi-round editing campaign. Each round involves identifying the next edit based on the user-specified target phenotype and experimental readout from the previous round. Designing cellular circuits involves using AI agents to predict the behavior of genetic components and optimize their arrangement to create circuits that perform tasks such as sensing environmental changes or producing therapeutic proteins.
Ethical considerations arise from biomedical AI agents [19, 20]. Allowing them to make changes in environments through ML tools or calls to experimental platforms can be dangerous. Safeguards need to be in place to prevent harm [21]. Conversely, discovery workflows might include conversations between AI agents (but no interaction with environments is allowed). In that case, we need to consider the impact of such interactions on scientists and their reliance on AI agents. Further, challenges uniquely relevant to biomedical AI agents include the lack of large experimental datasets that cover diverse use cases beyond the current focus on a handful of biomedical domains like structural biology and single-cell science. AI agents need to represent biomedical knowledge in a data-efficient manner and achieve strong generalization to new tasks with little or no additional training. Biomedical AI agents can assist with research and operations under human oversight, but their impact and challenges highlight the need for responsible implementation.
Evolving use of data-driven models in biomedical research
Over the past several decades, data-driven models have reshaped biomedical research by developing databases, search engines, machine learning, and interactive and foundation learning models (Figure 2). These models have advanced modeling of proteins [22, 23, 24, 25, 26], genes [27], phenotypes [28], clinical outcomes [29, 30, 31], and chemical compounds [32, 33] through mining of biomedical data.
Databases and search engines. In biological research, databases (DBs) [34, 35, 36] aggregate knowledge from experiments and studies, offering searchable repositories containing standardized biological data vocabularies. An example of such a database is the AlphaFold Protein Structure DB [37], which includes more than 200 million protein structures predicted by AlphaFold [38]. Molecular search engines retrieve information from these databases [39, 40, 41]. FoldSeek [42] retrieves protein structures from the AlphaFold DB by translating query structures into 3D interaction alphabet sequences and using pretrained substitution matrices. Search engines are designed to retrieve information based on specific queries, lacking the ability to refine these queries through reasoning. They cannot iteratively process obtained information to refine results or customize subsequent actions. Additionally, while databases reduce the risk of misinformation through curated data, they lack mechanisms to identify and remove irrelevant information.
Distinct from search engines, AI agents are capable of reasoning to formulate search queries and subsequently acquire information. Curated databases offer structured and factual information, aiding in reducing the risks associated with misinformation potentially generated by agent hallucinations [43, 44]. For example, the retrieval-augmented generation [44] is equipped for AI agents to answer questions based on scientific literature. A notable feature of these agents is their ability to retrieve information when needed and to create and iteratively process the obtained passages. This reflection process makes the agent controllable during inference, allowing for customization of its actions to meet task requirements beyond what is possible using search engines and database queries.
Machine learning models. Beyond information retrieval, ML models excel in identifying patterns and assimilating latent knowledge to generalize predictions about novel data [45, 46]. Existing machine learning models typically require specialized models for each task and do not possess the reasoning and interactive capabilities that distinguish AI agents. An example is the AlphaFold [38], which predicts 3D protein structures with high accuracy using multi-sequence alignment with a deep learning model but is tailored for protein folding. AI agents represent an evolution in ML models, building on the foundations of successes such as the transformer architecture [47] and generative pretraining [8]. These agents’ reasoning and interactive capabilities distinguish them from ML models, which typically require specialized models for each task. Unlike traditional ML models, agents assess the evolving environment, which is valuable for modeling dynamic biological systems.
Interactive learning models. Interactive learning, often referred to as active learning [48] and reinforcement learning [49], represents a further advancement in ML models by incorporating exploration mechanisms and human feedback. Active learning strategies can help build models for datasets with small sample sizes when conventional ML models might be insufficient due to limited statistical power. It selectively queries the most informative data points for labeling and optimizing the learning process, which improves how models learn with data. Reinforcement learning involves an agent learning how to act by observing the results of past actions in an environment, mirroring the trial-and-error approach. In biological research, interactive learning has been used for small molecule design [50], protein design [51, 52], drug discovery [53, 54], perturbation experiment design [55], and cancer screening [56]. For instance, GENTRL [50] uses reinforcement learning to navigate the chemical space and identify chemical compounds that can act against biological targets. However, interactive models are predominantly designed for narrow use cases and struggle to generalize to new goals without retraining the models from scratch. Leveraging interactive learning, AI agents achieve greater autonomy in information retrieval tasks. Active learning improves training efficiency through data labeling selected to maximize model performance. However, AI agents extend beyond this data-centric approach; for example, reinforcement learning with human feedback (RLHF) [49] uses a “reward model” to train and fine-tune an LLM-based agent with direct human feedback to understand human instruction naturally.
AI agents. Biomedical AI agents have advanced capabilities, including proactive information acquisition through perception modules, interaction with tools, reasoning, and engaging with and learning from their environments. Agents use external tools, such as lab equipment, and have perception modules, such as integrated visual ML tools, to receive information from the environment. Agents can incorporate search engines and ML tools and process information across data modalities via perception modules to generate hypotheses and refine them based on scientific evidence [1, 2].
Types of biomedical AI agents
The prevailing approach to building agents is to use LLMs, where a single LLM is programmed to perform various roles. However, beyond LLM agents, we envision multi-agent systems for discovery workflows that combine heterogeneous agents (Figure 1) consisting of ML tools, domain-specific specialized tools, and human experts. Given that much of biomedical research is not text-based, such agents have broader applicability to biomedicine than LLM-based agents alone.
Large language model based AI agents
Programming a single LLM with diverse roles equips LLM-based agents with conversational interfaces that emulate human expertise and can access tools [57, 58] (Figure 3a). The rationale behind this approach stems from pretraining an LLM to encode general knowledge, followed by in-domain fine-tuning of the LLM to encode domain-specific specialist knowledge and aligning the LLM with human users through role-playing and conversation. Instruction tuning [59] can be used for the former by training the LLM to follow human instruction through prompt examples, including dialogues that incorporate biological reasoning [60]. Additionally, RLHF optimizes LLM performance by selecting the most human-preferred outputs from a range of responses to specific prompts, further aligning LLMs with human roles. Consequently, a single LLM, programmed to fulfill multiple roles, can provide a more practical and effective solution than developing specialized models. By assigning specific roles, the agents can replicate the specialized knowledge of experts across various fields, such as structural biology, genetics, and chemistry, surpassing the capabilities of querying a non-specialized LLM [61] and performing tasks previously not possible [62]. Early results in clinical medicine question-answering suggest that assigning specific roles, such as clinicians, to GPT-4 [61] can achieve better performance in terms of accuracy on multiple-choice benchmarks compared to using domain-specialized LLMs like BioGPT [63], NYUTron [64], and Med-PaLM [65, 66].
We envision three approaches for assigning roles to biological AI agents: domain-specific fine-tuning, in-context learning, and automatic generation of optimized roles. The first approach involves instruction-tuning an LLM across many biological tasks to ground the LLM in the biological domain, followed by RLHF to ensure that the tuned LLM performs tasks aligned with scientists’ goals, wants, and needs. The second approach uses in-context learning of LLMs [67] to process longer contextual information provided in inputs, such as biologist-generated instructions, enabling agents to grasp the domain context for each task. This approach is supported by using textual prompts to define agent roles [62, 68]. Both strategies require biologists to carefully gather task-specific data or craft prompts. However, as roles defined by humans may not always direct agents as intended, there has been a movement towards allowing LLM-based agents more autonomy in role specification. This paradigm shift in role definition enables agents to autonomously generate and refine role prompts, engaging in self-directed learning and role identification. For instance, An agent’s ability to evolve and tailor its prompts in reaction to user inputs has been demonstrated in [69]. Additionally, the application of LLM as an optimizer to enhance prompt refinement and optimization for improved performance in assigned roles has been investigated in [70]. Through this self-referential learning framework, agents transition from task executors to entities capable of more autonomous learning.
The agent system, comprising a single LLM prompted to adopt various roles, has shown to be a valuable support tool in scientific research. Studies suggest that agents allocated specific roles exhibit enhanced capabilities compared to either sequentially querying a single LLM or employing a single tool repetitively. A case in point is Coscientist [1], which shows the potential of GPT-4-based agents for chemical research tasks, including optimizing reactions for palladium-catalyzed cross-couplings. Within Coscientist, GPT-4 undertakes the role of a planner, serving as a research assistant. The agent uses in-context prompts to use tools such as web and documentation search, code execution via Python API, and even symbolic lab language (SLL) [1]. To complete tasks that require access to a physical device, the planning agent starts with a prompt provided by the scientist and uses search tools to compile the documentation for the experiment. Following this, the agent generates SLL code and executes it, which entails transferring it onto the device and controlling the device.
Multi-agent AI systems
LLM-based agents implemented through autoregressive LLM approaches acquire skills such as planning and reasoning by emulating observed behaviors in training datasets. However, this mimicry-based learning results in limited agent capabilities, as they do not achieve a deep understanding of these behaviors [71]. Consequently, a single agent often lacks the comprehensive skill set needed to complete complex tasks. A practical alternative is deploying a multi-agent AI system, wherein the task is segmented into more manageable subtasks. This approach allows individual agents to address specific subtasks efficiently, even with incomplete capabilities. Distinct from single-LLM-based agents, a multi-agent system incorporates several agents endowed with specialized capabilities, tools, and domain-specific knowledge. For successful task execution, these agents must conform to working protocols. Such cooperative efforts equip LLMs with unique roles, specialized knowledge bases, and varied toolsets, simulating an interdisciplinary team of biology specialists. This approach is akin to the diverse expertise found across departments within a university or an institute.
In the following, we introduce five collaborative schemes for multi-agent systems.
Brainstorming agents (Figure 3b). Brainstorming research ideas with multiple agents constitutes a collaborative session to generate a broad spectrum of research concepts through the joint expertise of scientists and agents. In such sessions, agents are prompted to contribute ideas, prioritizing the volume of contributions over their initial quality to foster creativity and innovation. This method encourages the proposal of unconventional and novel ideas, allowing participants to build upon the suggestions of others to uncover new avenues of inquiry while withholding judgment or critique. The process enables agents to apply their domain knowledge and resources to form a collective idea pool. Each agent would provide insights and generate hypotheses based on their specialized knowledge, which the group can then integrate and refine. For example, in a multi-agent system designed for Alzheimer’s research, agents could specialize in microglia biology, neuronal degeneration, and neuroinflammation. To explore new therapeutic targets for Alzheimer’s disease, an agent specialized in microglia biology might propose investigating the role of microglial cells in synaptic pruning, while another agent focused on neuronal degeneration could suggest examining the protective effects of certain neurotrophic factors. These diverse ideas are pooled together, allowing researchers to explore a comprehensive range of potential research directions.
Expert consultation agents (Figure 3c). Expert consultation entails soliciting expertise from individuals or entities with specialized knowledge. This process involves expert agents gathering information from various sources and providing insights, solutions, decisions, or evaluations in response. Other agents or humans then refine their approaches based on this feedback. LLMs have the potential to assist in offering scientific critiques on research manuscripts, as demonstrated in recent studies [72]. However, LLMs lack the nuanced understanding of human reviewers and should be seen as complementary to, not a replacement for, human expertise. Similarly, an AI agent might consult another agent specialized in a specific area to refine ideas within AI systems, mirroring the mentor-mentee dynamics found in academic environments. In another example, in addressing Alzheimer’s and related dementias, diagnosing Alzheimer’s based on cognitive criteria might present borderline cases. Consulting an AI agent could offer additional perspectives, determining if such cases align with Alzheimer’s based on brain pathology or alternative biomarkers.
Research debate agents (Figure 3d). In a research debate, two teams of agents present contrasting perspectives on a research topic, aiming to persuade the agents of the opposing team. Agents are split into two groups, each adopting distinct roles for the debate. One group gathers evidence to fortify its position using various knowledge sources and tools, while the opposing group critiques this evidence, striving to expose or neutralize its weaknesses with superior evidence. The objective for each faction is to articulate their arguments more effectively than their rivals, engaging in a systematic discourse to defend their viewpoint and challenge the veracity of their adversaries’ assertions. This methodology promotes critical thinking and bolsters effective communication as each team endeavors to construct the most compelling argument supporting their stance.
Round table discussion agents (Figure 3e). Round table discussions involve multiple agents engaging in a process that fosters the expression of diverse viewpoints to make collaborative decisions on the topics under discussion. In such sessions, agents articulate their ideas and insights, pose questions, and provide feedback on others’ contributions. They then respond to these queries, refine their initial propositions based on feedback, or attempt to persuade their peers. This method promotes equal participation among all agents, urging them to contribute their expertise and perspectives, offer constructive criticism, question underlying assumptions, and suggest amendments to improve the proposed solutions. For instance, Reconcile [73] implements a collaborative reasoning scheme among LLM agents through successive rounds of dialogue. Agents attempt to convince each other to adjust their responses and use a confidence-weighted voting mechanism to achieve a more accurate consensus than if a single LLM-based agent is used. During each discussion round, Reconcile orchestrates the interaction between agents using a ‘discussion prompt,’ which includes grouped answers and explanations produced by each agent in the preceding round, their confidence levels, and examples of human explanations for correcting answers.
Self-driving lab agents (Figure 3f). The self-driving laboratory is a multi-agent system where the end-to-end discovery workflow is iteratively optimized under the broad direction of scientists but without requiring step-by-step human oversight [74]. Once the agent system is trained, it can describe experiments necessary to test the generated hypotheses, analyze the results of said experiments, and use them to improve its internal scientific knowledge models. Agents in the self-driving system need to address the following three elements: determine inductive biases to reduce the search space of hypotheses, implement methods to rank order hypotheses considering their potential biomedical value with experimental cost, characterize skepticism via uncertainty quantification and analysis of experiments in reference to the original hypothesis, and refine hypotheses using data and counterexamples from experiments [75]. Ideally, hypothesis agents are creative and reflective when developing biological hypotheses that extrapolate indirectly from the existing body of knowledge [16]. There is emerging evidence that generative models have the potential to generate novel hypotheses. [76] demonstrated that using latent knowledge from published materials science literature can recommend novel materials. [77] leveraged LLMs trained with an autoregressive pretraining objective to predict molecules. Experimental agents steer operational agents that use a combination of in silico approaches and physical platforms to execute experiments. Reasoning agents integrate the latest results to guide future experimental design. The utility of experimental results, such as the yield of high-throughput screening of a chemical library against a biological target, can be compared for different versions of the agent system given a time budget for hypothesis and experiment generation.
Levels of autonomy in AI agents
When integrated with experimental platforms, AI agents can operate at varying levels of autonomy tailored to the diverse requirements across biomedical fields. We classify these AI agents into four levels according to their proficiency in three areas of discovery: Hypothesis, Experiment, and Reasoning (Table 1). Specific capabilities within each area define these levels, necessitating that agents exhibit the capabilities for a given level across all areas (an agent with Level 3 capabilities in the Experiment area but only Level 2 capabilities in Reasoning and Hypothesis areas would be classified as Level 2).
Level 0, denoted as no AI agent, uses ML models as tools coordinated by interactive and foundation learning models. At this level, ML models do not independently formulate testable and falsifiable statements [78] as hypotheses. Instead, model outputs help scientists to form precise hypotheses. For example, a study employed AlphaFold-Multimer to predict interactions of DONSON, a protein with limited understanding, leading to a hypothesis about its functions [79]. Level 1, termed AI agent as a research assistant, features scientists setting hypotheses, specifying necessary tasks to achieve objectives, and assigning specific functions to agents. These agents work with a restricted range of tools and multimodal data to execute these tasks. For instance, ChemCrow [2] combines chain-of-thought reasoning [80] with ML tools to support tasks in organic chemistry, identifying and summarizing literature to inform experiments. In another example, AutoBa [81] automates multi-omic analyses. These two agents are designed for narrow scientific domains; ChemCrow and AutoBa optimize and execute actions to complete tasks that are designed and predefined by scientists. Level 1 agents [82, 83, 81, 2] formulate simple hypotheses inferred from existing knowledge and utilize a limited set of tools, lacking the capacity necessary to achieve Level 2 autonomy.
At Level 2, AI agent as a collaborator, the role of AI expands as scientists and agents collaboratively refine hypotheses. Agents undertake tasks critical for hypothesis testing, using a wider array of ML and experimental tools for scientific discovery. [76] However, their capability to understand scientific phenomena and generate innovative hypotheses remains constrained, highlighting a linear progression from existing studies. The transition to Level 3, or AI agent as a scientist, marks a major evolution, with agents capable of developing and extrapolating hypotheses beyond the scope of prior research, synthesizing concepts beyond summarizing findings and establishing concise, informative, and clear conceptual links between findings that cannot be inferred from literature alone, eventually yielding a new scientific understanding. While multiple Level 1 agents exist across various scientific fields, Levels 2 and 3 agents have yet to be realized.
The levels of autonomy described for artificial general intelligence (AGI) agents in scientific contexts, particularly in biology, deviate from existing taxonomies that focus on general human-AI interaction separate from the collaborative dynamics between scientists and AI. Existing taxonomies of autonomy consider solely the balance of responsibilities between AI agents and humans—with no consideration of biomedical discovery—and focus on developing AGI to surpass human performance across varying skill levels [84].
As the level of autonomy increases, so does the potential for misuse and the risk of scientists developing an overreliance on agents. While agents have the potential to enhance scientific integrity, there are concerns regarding their use in identifying hazardous substances or controlled substances [85]. Responsible development of agents requires developing preventive measures [86, 87]. The responsible deployment of agents must account for the risk of overreliance, particularly in light of evidence that LLMs can produce convincing but misleading claims and spread misinformation. The risks will likely increase as agents undertake more autonomous research activities. Agents must be scrutinized as scientists, including reproducibility and rigorous peer review of agentic research. We illustrate these definitions of levels by giving examples of progression between the levels in genetics, cell biology, and chemical biology. We selected these areas because of the availability of large datasets that have recently driven the development and application of ML models. We describe the challenges and limitations that biomedical AI agents may present for an enhanced understanding and progression through levels of autonomy.
Illustration of AI agents in genetics
Research in human genetics seeks to understand the impact of DNA sequence variation on human traits. LLM-based agents operating at Level 1 would perform specific tasks relevant to genetic studies. For instance, in a genome-wide association study (GWAS), a Level 1 agent can write bioinformatics code to process genotype data to (1) execute quality control measures, such as the removal of single-nucleotide polymorphisms (SNPs) missing in many individuals or control for population stratification [88], (2) estimate ungenotyped SNPs through imputation, and (3) conduct the appropriate statistical analyses to identify relevant SNPs, taking into account the false discovery rate [89]. Following the analysis, the Level 1 agent reviews and reports findings, including any filtered SNPs and rationales for their exclusion.
Instead of executing narrow tasks following human instruction, a Level 2 agent identifies and executes tasks independently to refine a hypothesis initially given by the scientist. For example, it may explore the effectiveness of drugs for a patient subgroup within complex diseases, where genetic underpinnings can influence drug response [90]. Given a hypothesis that a particular drug is effective in a subset of patients with idiopathic or genetic generalized epilepsy (GGE)—a condition with a robust genetic causality [91]—a Level 2 agent would synthesize genetic information from GWAS meta-analyses [92], such as the UK Biobank [93], targeted sequencing studies [94], and knowledge bases like Genes4Epilepsy [95]. The agent identifies GGE subtypes and causal genes by analyzing patient genetic data, predicting which subgroups might benefit from the drug based on genetic markers. It would then conduct in vitro functional studies to confirm these predictions, ultimately presenting evidence on how the drug could benefit GGE patient subpopulations by synthesizing concepts beyond summarizing findings.
Level 3 agents coordinate a system of agents (Figure 3) to discover and evaluate gene markers for specific phenotypes. These agents help initiate new study groups and optimize non-invasive methods of DNA collection for cost-effectiveness and recruitment processes [96]. Once data are collected, the agents innovate statistical methods to identify causal variants from genotypic data amidst confounders such as linkage disequilibrium [97] and develop in vitro techniques for validating candidate gene markers in disease models. Level 3 agents collaborate with scientists to generate and test hypotheses for comprehensive genetic insights.
Illustration of AI agents in cell biology
Cells are fundamental units of study in cell biology. Advances in single-cell omics, super-resolution microscopy, and gene editing have generated datasets on normal and perturbed cells, covering areas such as multi-omics [98, 99, 100], cell viability [101], morphology [102], cryo-electron microscopy and tomography [103, 104], and multiplexed spatial proteomics [105, 106]. This proliferation of data has spurred interest in in silico cell modeling [107].
ML tools have been instrumental in analyzing data across these cellular modalities, but as Level 0 agents, they lack autonomous research capabilities. At Level 1, agents integrate specialized Level 0 models to assist in hypothesis testing. These agents actively assist scientists in developing hypotheses by synthesizing literature and predicting cellular responses using integrated models. For example, to help investigate the resistance mechanism of a compound, Level 1 agents predict its effects in various cellular contexts [108]. These predictions also inform experimental design, such as spatial transcriptomic [109] and proteomic [110, 111] screening. Agents then retrieve and refine experimental protocols for execution on platforms [112] and apply predefined bioinformatics pipelines, as instructed by scientists.
Level 2 agents execute predefined tasks and generate hypotheses on cellular functions and responses. They autonomously define and refine tasks to support scientific reasoning, enabling practical exploration of complex phenotypes like drug resistance. By managing the experimental cycle and continuously updating their in silico tools, Level 2 agents actively optimize experiments to focus on key variables of resistance based on a synthesis of predictive content, uncertainty, and newly acquired data, with iterative feedback from scientists [55]. Level 2 agents thus form a prototype for a virtual cell model capable of hypothesis generation, encompassing closed-loop integration of digital and experimental platforms.
Level 3 agents respond to existing challenges and anticipate future directions in cell biology research. They form hybrid virtual cell models, an organic combination of AI tools (digital agents) with high-throughput platforms (experimental agents). Digital agents, such as LLM-based agents, autonomously identify critical knowledge gaps through literature synthesis based on criteria such as data volume, biological relevance, and clinical needs and simulate any perturbagen (extrinsic events such as gene knockouts and overexpression, compounds, cell-cell interactions; intrinsic events such as cell cycle) in any context. Experimental agents not only optimize experimental protocols [113, 102, 114] to enable high-throughput multimodal measurements but also develop transformative technologies to enable probing at unprecedented resolution across space and time across in vitro, ex vivo, and in vivo models, uncovering pioneering insights. The ability of level 3 agents to drive the discovery of novel biological mechanisms and therapeutic strategies shifts the role of scientists from performing operational tasks to ideation and managing hybrid cell models.
Illustration of AI agents in chemical biology
A major focus for chemical biology is understanding molecular interactions within cells to manipulate biological systems at molecular and cellular levels. An AI agent could analyze any molecular interaction, help design new drugs, and provide more valuable chemical probes for biological systems.
Despite considerable advances in applying ML to chemical biology, current approaches fall in Level 0. Scientists oversee all activities by integrating ML tools for structure prediction, docking, chemical synthesis, and molecular generation. At Level 1 the agent has elementary reasoning of chemical biology and can execute simple tasks autonomously such as running ML tools, or designing experiments for a given objective. However, due to limited reasoning capabilities, the agent may fail to explain more complex concepts, such as how the dynamics of molecules may influence the effects of drugs on binders or explore novel molecular scaffolds. For a level 2, the long-term objective is its function as a collaborator for scientists through excelling at tasks that are explicit continuations of existing scientific research, such as improving the efficiency of chemical probes, autonomously designing and testing de novo enzymes, or designing new binders by leveraging trends in related targets. Level 2 AI agents have deeper expertise in more domains, such as retro-synthesis, crystallography, bioassays, and directing robotic arms to conduct research.
The goal of a Level 3 agent in chemical biology is the ability to study all types of molecular interactions in a cell. This agent would work alongside human scientists to explore research questions that are challenging for the field, such as binder design for undruggable targets [115], significantly improving specificity and efficiency of in vivo bioorthogonal reactions, or developing new chemical probes that can access new spatial and temporal scales. Unlike the Level 2 agent’s use of well-established protocols, a Level 3 agent aims to unlock experimental capabilities that are not currently accessible. For example, AI agents could be tasked to probe molecular dynamics at longer timescales than what is currently accessible. At this level, agents have a thorough understanding of existing literature and work alongside scientists to unlock new fields of chemical biology.
Roadmap for building AI agents
An AI agent is built as a compound system that consists of modules [3, 57, 58] each implementing a distinct functionality. Here, we describe these modules (Figure 4), focusing on perception, interaction, memory, and reasoning modules necessary for AI agents to interact with humans and engage with experimental environments. Interactions between the agent and its environment are characterized by two elements: the agent’s perception of its surroundings and its subsequent engagement with them. Perception modules enable the agent to interpret and assimilate information from various data modalities. Then, learning and memory allow agents to interact with an environment and complete tasks, by acquiring new knowledge and retrieving previously learned one. Finally, the reasoning module processes information and executes action plans. Using a published study as an example [116], Figure 5e illustrates a hypothetical AI agent that examines the molecular mechanisms of selective removal of mitochondrial DNA mutants in the Drosophila female germline through perception, interaction, memory, and reasoning modules.
The division of research into smaller tasks handled by AI agents presents an intriguing approach, building on the success of modular and sequential bioinformatics workflows like Snakemake and Docker. Unlike these workflows, which are often static and require manual updates and reconfiguration to handle new tasks or integrate new tools, AI agents are dynamic and operate in a personalized, user-specific, and context-appropriate manner. They can learn to use new tools and adjust their workflows based on the specific instructions and needs of the scientist. Further, the adaptive allocation of tasks by AI agents can be helpful in automatically incorporating new tools and restructuring existing pipelines, much like a human researcher would. For example, AI agents could experiment with and create new protocols beyond the currently established methods in integrating multimodal omics data. For instance, while established protocols for integrating multi-modal, such as scRNA-seq with scATAC-seq or spatial data, exist, AI agents could develop new pipelines for multi-modal integrations beyond the three modalities, or multi-scale integrations such as atlas-scale single-cell and bulk RNA-seq data, or normal and disease state data from cell lines, organoids, and patient samples, based on their initial attempts.
Perception modules
Perception modules equip LLM-based agents with the capability to understand and interact with elements in the environment in which they operate, such as biological workflows and human users. For perception, agents need to integrate abilities to receive feedback from multiple sources: scientists [49], the environment [62], and other AI agents [117, 13]. This requires accommodating a diverse array of modalities. These include text descriptions [6]; images from light and (cryo-)electron microscopy to assess cellular processes across many conditions simultaneously [103, 104, 118]; videos from live imaging to assess developmental processes or animal behaviors across time [119]; longitudinal biosensor readouts and genomics profiles of cells [120]; mass spectrometry-based proteomics to decipher protein homeostasis [121, 24]; and miniaturized platforms for conducting biochemical assays and 3D culture systems that mimic the physiological context of organ systems [112].
AI agents can take different approaches to interacting with environments. The most direct one involves using natural language, which represents a common perception modality for LLM-based agents. Other techniques involve multi-modal perception modules, where agents process multi-modal data streams from the environment or align multi-modal inputs with text-based LLMs.
Conversational modules. With the rise of ChatGPT, the ability of AI agents to interpret natural language has reached such a high level [49] that it is now possible to build interfaces to agent systems that are entirely based on natural language with limited misinterpretations. The main focus is chat interfaces that preserve conversational history in a scrolling window, where users can converse with agents in a manner that resembles the standard approach of written human-to-human interaction. This approach allows scientists to express their queries using their language, promoting initiative and enabling them to precisely describe what they want. We envision that agents will maintain a history of interaction with scientists using natural language, which, in turn, will allow us to keep track of scientific interactions with agents [62, 68]. Combining the history trace of these interactions with retrieval-augmented generation (RAG), it will be possible to develop personalized discovery workflows tailored to individual scientists.
Multi-modal perception modules. Agents align LLMs with other data types to consider data modalities beyond natural language. This approach helps agents better model the changing environment in which the agent acts and dynamically adjust its outputs to new situations, such as evolved biological states in a virtual cell model. The alignment process involves two main strategies: textual translation and representation alignment. Textual translation converts inputs into a textual format, such as transforming data from robotics into textual descriptions that log environmental states [9]. For example, when handling readouts from experimental devices, the readouts can be combined with a textual description of their meaning, allowing the LLM to understand the readouts as a new modality. Alternatively, through representation alignment, data from different modalities are analyzed by modality-specific models to generate representations, such as using the visual encoder from CLIP [122] for visual information processing. These representations are then aligned with LLM textual representations through instruction tuning [123, 118], enabling agents powered by LLMs to perceive and interpret multi-modal data. For instance, to make LLM-based agents handle the protein structure data, an additional encoder is required to encode the protein structure data into a representation aligned with the LLMs’ representation space. This encoder is pre-trained with modality-specific training schemes, and an adaptor is placed between this encoder and LLMs to align the representations of the two modalities. Then, instruction tuning is applied using data containing both modalities to train the adaptor for alignment. An alternative to alignment involves allowing the agents to receive input expressed in different modalities [7, 124]. For instance, Fuyu [124] uses a decoder-only transformer architecture to process image patches and text tokens jointly. Similarly, Gemini [7] is engineered to handle visual, audio, and text inputs within a single model. Once perception modules are implemented for agents to receive inputs from the environment, modules for interaction and reasoning follow to process the inputs and interact externally. Training agents with strong perception abilities on biomedical data requires extensive, high-quality data pairs that align multiple modalities. However, collecting such data remains challenging. For example, multimodal experimental platforms are non-existent or have low-throughput yields, certain tissues, and cell types are not experimentally available, and a long tail of disease phenotypes has small sample sizes, making data collection infeasible.
Interaction modules
Beyond conversational modules, scientists use ML-based and other tools in biological research, explore datasets through graphical user interfaces (GUIs) to analyze and visualize data, and engage with physical equipment and wet lab experimental platforms. Chat-optimized LLM-based agents thus need interaction capabilities to communicate and collaborate with scientists, other AI agents, and tools to function beyond a simple chatbot. Agents must incorporate essential interaction modules to interact with elements in the environment. These include agent-human interaction to support communication with scientists and following human instruction [125, 126], multi-agent interaction for collaboration among agents, and tool-use action to access ML tools and experimental platforms.
These interaction abilities of LLMs, when combined with interactive ‘function calling’ (i.e., LLM requesting for tasks to be completed), can act as an intermediary between scientists and the agent’s interface, as well as between scientists and various functional items, such as tools and other agents. This approach allows scientists to express their intentions in natural language without needing to search for how and where to accomplish tasks. At the same time, the advantages of functional items are preserved because agents can interact with tools and use them to provide feedback. However, interactive modules trained on general, non-biological domains might not be well-suited for specialized biomedical terminologies, requiring in-domain training on biomedical tools.
Agent-human interaction modules. The interaction between scientists and AI agents synchronizes scientific objectives with AI agents through cooperative communication and modeling of biological knowledge. Natural language processing and human evaluation methods are predominantly used to develop this interaction capability. InstructGPT [49] enhances the GPT model through supervised fine-tuning with examples of human dialogues to improve the model’s conversational skills. The alignment between agents and humans can be refined through RLHF, which adjusts the model based on a reward model trained using human assessments of the model’s responses. Alternatively, RLHF can be replaced by direct preference optimization [127], which is a parameterized method that provides a more consistent and efficient alignment with human preferences. Through agent-human interaction, agents become attuned to human needs and preferences [126, 10], using human insight as a directive for carrying out complex tasks [13]. For instance, Inner Monologue [126] employs human feedback to discern user preferences or interpret ambiguous requests in an embodied context. In AutoGPT [10], humans formulate tasks and score solutions returned by agents, and AutoGen [13] can use human expertise to solve tasks better than agents alone.
Multi-agent interaction. Multi-agent interactions support solving complex goals that agents could not complete if they operated independently. In such interdisciplinary systems, agents that could specialize in different biological domains, each with distinct capabilities, engage in interactions through various communication means. Language has emerged as the predominant medium for multi-agent interactions due to the ability of agents to communicate with humans linguistically [4, 128, 13, 117, 73]. An instance of this is Generative agents [62], which create interactive environments where agents mimic human behavior and interact using natural language. Different strategies are used for multi-agent interaction, including cooperation [129, 130, 131] and negotiation [73, 132, 133]. For example, MetaGPT [130] applies standardized operating procedures from human teamwork to define tasks and agent responsibilities.
Through these approaches, agent interactions make it possible to tackle tasks that are too complex for just one agent to handle [134, 82]. MedAgent [82] leverages the expertise of multiple medical AI agents for medical reasoning. Similarly, RoCo [134] employs robot agents with varied roles to accomplish complex tasks in the physical world. Multi-agent interaction can also boost the proficiency of less skilled agents by allowing them to learn from more experienced counterparts [135]. These interactions also enable the creation of simulations for a variety of environments, ranging from public health scenarios [136] to human social behaviors [62, 137], enhancing the system’s adaptability and application in diverse contexts.
Tool use. To manage tasks from diverse environments, agents require tools to boost their capabilities [138]. Commonly used tools are application APIs [139], search engines [140], ML models [141], knowledge databases [142], and robotic machinery for physical tasks [143, 144, 9]. Different Level 1 agent systems have been developed that can interact with one or more types of tools. ChemCrow [2] leverages chemical tools and search engines to address chemical challenges. WebGPT [140] can conduct searches and navigate web browsing environments. SayCan [144] controls a robot in the physical world using an LLM to complete tasks. To invoke these tools, AI agents generate commands in specific formats [141, 139, 142] or query pre-trained control models to execute actions [144, 145]. To develop these capabilities, agents can use in-context learning [141] or fine-tuning with tool-use demonstrations [139], where the latter represents a more sophisticated approach.
In the case of in-context learning, it is necessary to include system abilities in the prompt so agent systems can use ‘function calling’ to query tools. For example, HuggingGPT [141] uses ChatGPT as a controller to integrate all ML models on Hugging Face through in-context learning. The alternative approach consists of using model fine-tuning with ‘function calling’ to create an LLM-based agent with integrated abilities of a function/tool. For instance, Toolformer [139] introduces a self-supervised learning method to master the use of tools’ APIs with minimal demonstrations for each API.
By modeling scientists’ needs by analyzing natural language textual inputs, AI agents can select the most likely available tool, identify the desired user interface component, and execute the scientist’s expected actions. Interaction modules are designed to be integrated and adapted to suit changing environments. For Level 2 and Level 3 agents, agents autonomously learn new types of interactions and how/when to start using new tools.
Memory and learning modules
When using tools and ML models for biological research, scientists keep records of experimental logs and plan their next steps based on them. In AI agents, memory modules alleviate the need for manual log recording by memorizing necessary experimental outputs. Contrary to ML models that perform one-time inference to generate predictions, memory modules in LLM-based agents store and recall information. This is necessary for executing complex tasks and adapting to new or evolving environments. Memory modules are designed to store long-term and short-term learned knowledge. As agents encounter new situations and acquire data, memory modules get updated with new information.
Long-term memory modules. Long-term memory stores essential and factual knowledge that underpins agent behavior and understanding of the world, ensuring this information persists beyond task completion. This memory can be internal, encoded within the model’s weights via learning processes [8, 146], or external, maintained in auxiliary knowledge bases [147, 148]. Internal memory is directly used for accomplishing zero-shot tasks [6, 7] while accessing external memory requires actions by the agent to fetch and integrate data into short-term memory for immediate use [149, 150]. For instance, ChatDB [142] uses an external database for memory storage, and MemoryBank [151] encodes memory segments into embeddings for later retrieval. Agents can query knowledge banks, such as a GWAS database to find genetic evidence for a candidate protein target, a knowledge base of therapeutic mechanisms of action, and scientific literature with up-to-date information for the agent to integrate and decide whether the protein can be modulated through a therapeutic perturbation (Figure 5b). The learning process updates long-term memory by adding new knowledge or replacing outdated information. Internal memory of an agent can be updated using parameter-efficient fine-tuning [146, 152], interactive learning [49], and model editing [153]. These strategies must be effective for large models [152] and avoid the loss of previously learned information [154]. On the other hand, updating external memory is more straightforward, involving modifications to the knowledge base [142, 151]. For example, in drug discovery, updating long-term memory by adding a new compound in development to the drug bank is a convenient way to maintain an up-to-date agent.
Short-term memory modules. AI agents use short-term memory to temporarily store information during their interactions. This short-term memory is enabled through in-context learning, where relevant information is integrated as context prompts [155, 144] or via latent embeddings [123, 118] in LLMs. For chatbots, previous conversations are kept as text prompts, supporting multiple rounds of dialogue [49, 156]. The text-based approach lays the groundwork for communication in multi-agent [73, 133] and agent-human scenarios [10, 13]. In embodied AI agents, environmental feedback [144, 155] is captured in textual format, acting as a short-term memory that aids reasoning. Following perception, multi-modal inputs are converted into latent embeddings, which function as short-term memory. LLaVA [118] uses latent embeddings generated by visual encoders to retain visual information. Short-term memory allows agents to temporarily acquire skills, such as tool usage [141, 139], store information about recent states of a biological system [156, 155], and keep track of outcomes from earlier reasoning efforts [11]. This learning mechanism is crucial for agents to learn and apply new knowledge under new conditions. Moreover, short-term memory can temporarily override long-term memory, allowing agents to precede recent information over older knowledge within their model weights [157]. Agents can be informed by past experiences stored in their short-term memory to tell which experiments to run in the future. In Figure 5a, we detail an example where the agent recalls experiments for a homologous protein to inform the initial inhibitor design for the given protein.
Reasoning modules
Biological research involves a multidisciplinary and multistage process that integrates the expertise of scientists from various disciplines. Scientists formulate hypotheses, design experiments based on these hypotheses, interpret the results, and plan the next steps. The integration of reasoning capabilities in AI agents can assist biological research throughout this process. Reasoning improves agents’ capabilities to plan experiments, make decisions on biological hypotheses, and resolve competing candidate biological mechanisms. AI agents that use large language models can implement interactive dialogue systems to explain ML models through natural language conversations. Reasoning modules can be implemented using prompting [158] and few-shot in-context learning [80]. Additionally, agents can use planner models [159, 160] and action models [155]. We classify reasoning modules into two categories: direct reasoning and reasoning with feedback, depending on whether agents adjust their plan in response to experimental or human feedback.
Direct reasoning modules. In direct reasoning, an agent performs planning and reasoning based on the current state of the environment, which can follow different reasoning patterns, such as single-path and multi-path reasoning. Single-path reasoning involves the agent breaking down the task into multiple recursive steps [161]. For instance, chain-of-thought (CoT) reasoning allows agents to reason step-by-step either by using in-context examples [80] or by applying a zero-shot prompt like "Let’s think step-by-step” [158]. Leap-of-thought [162] encourages the model to use creative rather than logical reasoning. Although single-path reasoning matches well with certain situations [163], its ability to adjust to different conditions is limited.
Conversely, multi-path reasoning examines several paths before consolidating them into a final plan [164, 165], allowing for a more thorough planning process that accounts for different scenarios. For example, Least-to-Most prompting [166] breaks down tasks into subproblems solved sequentially. Self-consistent CoT [167] chooses the most consistent answer from a set of CoT answers. Tree-of-thoughts [164] extends reasoning paths into a tree-like structure, generating multiple paths from each thought node and using search algorithms to select the final path. Graph-of-thoughts [168] further develops reasoning paths into a graph structure for complex reasoning. To identify the optimal path, methods such as voting strategies [167], Monte Carlo tree search [169], and breadth/depth-first search algorithms [164] are used. Through direct reasoning, agents can generate multiple threads of thought that could consider the best pathways, protein targets, and experiments that can be run to test the role of a candidate protein target (Figure 5c).
Reasoning with feedback. Experimental and human feedback can help AI agents to improve reasoning and planning processes [11, 68, 149]. This feedback may include agent-human interaction and responses from agents, which can be complementary biological assays quantifying downstream effects of target molecules [170]. In each reasoning cycle, React [11] incorporates insights from previous actions to refine its thought process and inform future actions. LLM-Planner [171] dynamically adjusts plans based on new observations in an embodied environment. Inner Monologue [126] uses both passive and active scene descriptions and feedback from recent actions to guide future actions. Voyager [68] improves planning for subsequent steps by considering environment feedback, execution errors, and self-verification.
Beyond external feedback, an agent’s feedback mechanism enables self-assessing the initial plan [170, 172]. Techniques like self-refine [170] revise action outputs based on the LLM evaluation, the self-check [170] mechanism allows the agent to review and adjust its reasoning, and reflection [12] mechanisms use prompt agents to update their decision-making. These techniques incorporate feedback from biologists, such as exploring experimental methods and environmental constraints like lab inventory (Figure 5d). Reasoning capabilities are necessary for generating hypotheses and conducting experiments. Generating novel hypotheses requires modeling general biomedical knowledge, the specific information on the current state of a biological system, and consideration of potential next steps. LLM-based agents can generate hypotheses through in-context reasoning, but careful selection is necessary to ensure high-quality hypotheses [173].
Challenges
The perspective outlines key steps to implement AI agents in biomedical research and highlights areas that can benefit from agentic AI. Challenges remain and may, in some cases, be amplified when multi-agent systems become available (Figure 6).
Robustness and reliability
A barrier facing the deployment of agent systems – specifically those categorized within Levels 2 and 3 as discussed inTable 1 – is their propensity for generating unreliable predictions, including the hallucination of non-factual information, reasoning errors, systematic biases, and failures in planning when connected with tools and experimental platforms. These issues can be exacerbated by overconfidence in such flawed predictions (agents lack awareness of their knowledge gaps) and high sensitivity to the precise formulation of queries, particularly in the context of LLM-based agents. This behavior has been traced to how these models are trained. In particular, autoregressive loss compares the predicted word sequence with the actual sequence in the training data. The performance of a model trained with this loss is determined by three factors: the probability distribution of the inputs, the sequence of generated outputs, and the frequency of different tasks encountered during training [174]. As a result, model performance degrades on task variants that deviate from the assumptions made during training [175].
Sensitivity to input and task probability also offers a potential explanation for the widely observed success of various prompting techniques [80, 176, 164] (methods to paraphrase the same query). By providing informative context, instructive reasoning steps, or representative examples, these techniques can act as an empirical means by which task and input probability (and, thus, model performance) are increased. However, crafting high-quality prompts tends to be highly empirical while requiring significant effort and domain knowledge.
Beyond the linguistic domain, even the most advanced models fail in tasks with real-world entities that require physically meaningful actions, posing an obstacle to embodied agents. While embedding continuous sensor data into a language model can lead to improvements [120], limitations to understanding physical interactions and long-horizon planning remain. The complexities of training such multi-modal systems, the need for large datasets to cover the range of embodied tasks and environments, and the computational demands of processing multi-modal inputs all remain open questions [7]. Deployment faces challenges from false negatives causing repeated attempts and eventual stalling of the embodied agent [126]. Hence, it is necessary to verify the agent action plan before execution.
Uncertainty quantification can trigger fall-back safety measures like early termination, pre-defined safe maneuvers, or human-in-the-loop interventions. However, foundation models cannot reason about the uncertainty associated with their outputs, and no well-established statistical protocol exists for increasingly ubiquitous architectures [177, 47]. Techniques such as various forms of prompting, e.g., [178, 167, 179] estimate uncertainty based on the model’s predictive distribution, p(output|input), which may itself be subject to bias ([174], Section 3.3); furthermore, it does not consider the distribution of model parameters consistent with the observed training data and marginalizes over its predictions [180]. While conformal prediction [181] has emerged as a framework for uncertainty estimation of model predictions, its sensitivity to the choice of underlying statistical assumptions and the calibration of confidence levels have been criticized. The lack of a default technique is partly due to the difficulty of establishing a thorough quality assessment of uncertainty estimates. This makes it difficult to make choices in agent design and to reassure users about its calibration.
One concern is that advanced capabilities come at the cost of compromised transparency and the risk of misalignment. For instance, integrating human feedback can promote desirable agent behavior, but it can also exacerbate persuasive abilities, echoing false beliefs [182]. Fine-tuning existing models with new data can compromise their original alignment, challenging the integrity of the AI agent’s intended purpose [183]. Jailbreak attacks can similarly affect post-deployment, highlighting the need for rigorous evaluation [184].
Errors are inevitable in complex multi-agent systems, making their management crucial to maintaining system robustness and reliability. Due to their interactive nature, these systems are sensitive to compounding errors, where small issues can escalate into significant problems if not addressed promptly. Effective error management strategies are essential for diagnosing, localizing, and mitigating such errors.
Evaluation protocols
With more AI agents being developed, frameworks for biologists and lay user evaluations need to assess axes of agent performance beyond accuracy. Evaluating AI agents requires an analysis of their theoretical capabilities and an assessment of practical implications, including ethical considerations, regulatory compliance, and the ability to integrate into discovery workflows. The challenge lies in developing evaluations that consider these diverse factors. Agents that integrate ML tools, particularly those developed by corporations, may undergo updates without prior notice to users. This poses challenges for reproducibility, as updates may alter the model’s behavior or performance without researchers being aware. The scientific community needs transparent change logs and version control for agents, akin to practice in software development.
Existing evaluation frameworks consider either holistic evaluations [185, 186] or benchmark the models for weak spots such as task framing [187, 188], long temporal dependencies, invalid formatting or refusal to follow instructions [189]. A caveat of such frameworks is the risk of evaluating how well the agents have learned to use specific APIs versus general results grounded in real-world interaction. Another challenge in evaluating agents is that biological systems are inherently dynamic, characterized by non-stationary distributions that evolve due to genetic mutations, environmental changes, and evolutionary pressures. Agents trained on static datasets may struggle to accurately model or predict outcomes in these changing systems. The challenge lies in developing agents capable of adapting to or continuously learning from new data, ensuring their predictions remain accurate as the underlying biological systems change. Techniques such as online learning, transfer learning, and reinforcement learning can be used to address this issue, but they come with their own set of challenges related to data availability and model complexity. Another challenge is the lack of standardization in biomedical discovery workflows, including data generation protocols that vary based on factors like disease cell lines, dosage levels, and time points [190]. This variability complicates the evaluation of agents for experimental planning. Evaluation of agents that use computational tools and databases will benefit from the increasing availability of standardized and application APIs [191, 192].
Dataset generation
As laid out, the vision for biomedical AI agents requires the capability of seeking, aggregating, perceiving, and reasoning over data from various modalities, created using differing specifications and with inherent variation in quality and volume. To support this vision, there is a critical need for large, open datasets that are both comprehensive and accessible, enabling the development of models across biological applications. Much human effort in building systems for biomedical research is dedicated to gathering and preparing such data for use in ML models (e.g., specific to a particular modality, such as graphs, time series, or discrete sequences [193]). This requires vetting processes and clear criteria for assessing the reliability and applicability of datasets.
Noisy data, characterized by errors, inconsistencies, and outliers, poses a significant challenge for models attempting to extract meaningful patterns and insights with minimal human oversight or data preparation effort. In addition, multi-modal data requires models to process different data representations and formats and bridge semantic gaps between them. Tackling these challenges necessitates advanced feature extraction, fusion, and noise mitigation techniques while maintaining robustness. As no pretraining phase (no matter how extensive) will be able to provide adequate examples from all data sources, models will also have to generalize to previously unseen sensory inputs.
Governance of AI agents
The governance of AI agents presents challenges that intersect technological, scientific, ethical, and regulatory domains. One challenge is establishing comprehensive governance frameworks that balance innovation with accountability [194]. As AI agents gain autonomy, the necessity for robust guidelines to ensure responsible development, deployment, and commercialization grows. The discourse increasingly advocates for agent safeguarding to take precedence over further advancements in autonomy. Yet, navigating the regulatory landscape and forging an international consensus on AI governance remains complex while the advancement of agent capabilities continues. Striking a balance between innovation and safeguarding against potential risks requires collaboration among industry leaders, scientists, and policymakers [195].
Safe adoption of AI agents requires addressing concerns of safe deployment. Aligning ML tools, such as LLMs, with ethical standards remains an open challenge, and ensuring the alignment of the agent as a digital entity raises complexity. Guidelines concerning human-agent interactions are underdeveloped despite the potential for unintended harmful consequences and malicious intent. Safeguarding frameworks are developed that include training, licensing, and mandatory safety and ethical compliance checks for agents [86].
As AI agents become more integral to workflows in biological domains, monitoring their behavior grows increasingly complex. Currently, verifying the accuracy and trustworthiness of agent outputs is not straightforward, with only a limited number of systems capable of linking generated content to relevant references. It is essential to develop robust verification systems that can provide traceable references for generated content. Assessing the synthesized knowledge may be impractical and unattainable as agents evolve further. When agents’ capabilities become comparable to those of human experts, the risk of becoming overly reliant on AI increases, which could lead to a decrease in human expertise. In the worst-case scenario, such reliance could introduce a broad spectrum of safety hazards due to inadequate oversight. To address these challenges, human-in-the-loop approaches can help maintain accountability. Continuous training and development of human expertise alongside AI can mitigate the risks of over-reliance on AI.
Risks and safeguards
Autonomous experiments that do not include careful planning, broad consultation, competent execution, and ongoing adaptation might create long-term harms that outweigh the benefits. Although anticipating all potential complications is impossible, exploring possible problems early and frequently could reduce the expected cost of such issues. The ethical and technical considerations relevant to AI agents are vast and deeply interconnected, particularly in biomedicine. This section will highlight some key categories.
Neglect can lead to risks similar to those of malicious intent. Multi-agent systems where some agents represent LLMs might, through equipment malfunctions and insufficient maintenance, inadvertently create harmful substances, for instance, by contaminating a procedure that would otherwise be safe. This issue is not unique to multi-agent systems; instead, it is a general lab safety concern. However, the absence of close human supervision removes a critical auditing layer. The increased role of automation in agent systems raises safety issues: a powerful, unaligned system prone to misinterpreting user requests or unfamiliar with lab safety practices could, given access to a well-stocked scientific facility, do damage by, for instance, mixing volatile substances or developing and dispersing toxins or pathogens. These are among the scenarios that most concern AI safety researchers.
Agents leverage LLMs’ world knowledge and general reasoning abilities obtained during pretraining for robotics and planning. However, while efforts have been made to teach the robots the “dos,” the “don’ts” received less attention. Teaching robot agents the “don’ts” is crucial to convey instructions about prohibited actions, assessing the agent’s understanding of these restrictions, and ensuring compliance [196]. For LLM agents, plug-in safety chips [196] feature safety constraint modules that translate natural language constraints into formal safety constraints for the robot to adhere to. Experiments with robots highlight the potential for integrating formal methods with LLMs for robotic control.
LLMs trained in code completion can write Python programs from docstrings [197] by training the model on the code completion task to write the code based on natural language commands [198]. Given natural language commands, these code-writing LLMs can be re-purposed to write robot policy code. However, if the translation inaccurately reflects the intended safety constraints, it could lead to either overly restrictive behavior, preventing the robot from performing its tasks effectively, or insufficiently stringent constraints, leading to safety violations. However, the robot policy code is less reliable for enforcing safety constraints than verifiable safe operations that satisfy standards such as ISO 61508. The approach assumes that all given instructions are feasible and lacks a mechanism to predict the correctness of a response before execution. However, due to their reliance on patterns in the training data, LLMs might generate syntactically correct but semantically inappropriate code. Additionally, generalizing plans across robotic embodiments is brittle with current LLMs.
Addressing the ethical implications of AI agents is paramount, given the direct impact on human and animal health and life. The handling of sensitive biological and medical data necessitates robust technological and regulatory measures to ensure security and confidentiality. One promising approach involves using privacy-preserving computation to train agents to protect the privacy of highly sensitive medical data. Homomorphic encryption can secure sensitive data by allowing computations on encrypted data and federated learning techniques allow training agents in a distributed manner without the need to centralize from across sites into a single data repository.
Algorithmic fairness is equally crucial, as biased AI agents can exacerbate health disparities across patients and increase inequalities in the volume of generated datasets and quality of biomedical knowledge, especially for diseases in long-tailed distributions in biological systems. The development of techniques such as adversarial debiasing and fair representation learning offers promising avenues to mitigate these risks. In addition, the black-box nature of these compound AI systems poses another challenge, particularly in healthcare, where interpretability is vital for clinical adoption and patient trust. To provide clearer rationales for the agents’ decisions and make them more acceptable to users, it will become crucial to incorporate interactive dialogue systems that explain agentic outputs through natural language conversations. Ethical considerations surrounding biosafety emerge as AI agents advance toward Level 3 agents. These issues intersect with ongoing debates in bioethics regarding synthetic biology, artificial organisms, and AI-driven life forms, requiring regulatory guidance and engagement from bioethicists and safety experts to ensure alignment with societal values and safety standards.
Challenges uniquely relevant for biomedical AI agents
Biomedical AI agents face several unique challenges that distinguish them from other applications of AI. While strong AI agents have the potential to mitigate some of these challenges, their implementation in biomedical research requires careful consideration. One of the primary challenges is the need for robust and reliable systems capable of reasoning, planning, and executing actions in both virtual and hybrid virtual-physical environments. For instance, natural language reasoning chains can enhance the interpretability of an agent’s actions and contextual outcomes, aiding researchers in understanding AI-generated insights. However, certain challenges persist that can delay the reliable implementation of AI agents or even cause harm if these systems are deployed prematurely. A critical issue is the difficulty in distinguishing between correlation and causality. Current AI agents often struggle with generating strong hypotheses, reasoning, and conducting experimental validations, tasks that typically require advanced AI systems (Level 3 agents) or human intervention. Moreover, AI agents need improved interfaces to interact safely and effectively with high-throughput experimental platforms. These platforms themselves face limitations in producing unbiased, AI-ready datasets that accurately capture the intra- and inter-variation inherent in biological systems. Such limitations hinder the generalization capabilities of AI agents, which rely on comprehensive and high-quality data to function optimally. The absence of data from high-throughput techniques can lead to AI agents forming false hypotheses or causing harm. This risk is exacerbated when AI agents work with small, biased biological datasets, which may be affected by issues like batch effects.
Outlook
Biomedical research is undergoing a transformative era with advances in computational intelligence. Presently, AI’s role is constrained to assistive tools in low-stake and narrow tasks where scientists can review the results. We outline agent-based AI to pave the way for systems capable of reflective learning and reasoning that consist of LLM-based systems and other ML tools, experimental platforms, humans, or even combinations of them. The continual nature of human-AI interaction and building trustworthy sandboxes [199], where AI agents can fail and learn from their mistakes, is one way to achieve this. This involves developing AI agents proficient in various tasks, such as planning discovery workflows with machine learning feedback loops for experiments and performing self-assessment to identify and seek out gaps in their knowledge, fostering natural and artificial intelligence.
Ensuring context-appropriate and user-specific agent behavior
To ensure agents behave as intended, it is essential to focus on their robustness and reliability by implementing evaluation protocols that test agents in diverse scenarios to identify potential vulnerabilities. Moreover, grounding agents in ethical guidelines and documentation, such as lab protocols and safety guidelines, is vital to align their actions with human values and safety standards. By addressing these aspects, we can ensure that the behavior of biomedical agents is both reliable and ethically compliant.
Concretely, we believe that in the early stages of technological adaptation, it is desirable to limit an agent’s capabilities to a subset of their full potential by restricting action spaces, thereby eliminating the chance of catastrophic risk (e.g., decisions resulting in loss of life). Similar precedents for technological adaptation are already in place for other autonomous systems with similar risk profiles, such as autonomous driving, where a staggered technological adaptation is motivated by ethical considerations.
Governance and responsible human-AI partnership
Managing errors requires designing strategies to diagnose, localize, and mitigate them. To diagnose errors internally, agents should use their reasoning abilities to build self-evaluation schemes, allowing them to assess their current status and actions. Externally, training independent anomaly detection and distribution shift models with domain knowledge of specific biomedical use cases can provide additional supervision to diagnose errors. Iterative agent interactions can result in cascading errors. To mitigate this, the evaluation agent can apply reverse reasoning chains to trace back to the initial error. Enhancing the adaptive reasoning abilities of agents is crucial for dynamically adjusting to changing conditions and rectifying errors as they occur.
In addressing the challenge of governance, it is our view that the required broad consensus is best reached in multi-disciplinary, cross-partisan, non-profit, and public institutions with the objective of public good. In this regard, we welcome the recent establishment of several public AI-focused safety institutions to facilitate that discussion. Our concrete advice for biomedical AI agents would be to establish focus groups or select committees with the required expertise that can define necessary ethical and technical evaluation standards, based on which concrete regulation can be determined (e.g., a necessary degree of human oversight and accountability structures). We furthermore advocate for standards and policies to be developed in the broadest international institutions possible to reduce the chance of risks simply being outsourced to jurisdictions with no existing or unenforceable regulations.
By fostering responsible human-AI partnerships and robust governance frameworks, we can harness the transformative potential of AI agents in biomedical research. This collaborative approach can pave the way for groundbreaking advances, eventually enhancing human health and well-being.
Declaration of interests
The authors declare no competing interests.
Acknowledgments
We gratefully acknowledge the support of NIH R01-HD108794, NSF CAREER 2339524, US DoD FA8702-15-D-0001, awards from Harvard Data Science Initiative, Amazon Faculty Research, Google Research Scholar Program, AstraZeneca Research, Roche Alliance with Distinguished Scientists, Sanofi iDEA-iTECH Award, Pfizer Research, Chan Zuckerberg Initiative, John and Virginia Kaneb Fellowship award at Harvard Medical School, Aligning Science Across Parkinson’s (ASAP) Initiative, Biswas Computational Biology Initiative in partnership with the Milken Institute, and Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University. A.F. is supported by the Kempner Institute Graduate Fellowship. A.N. is supported by the Herchel Smith-Harvard Undergraduate Science Fellowship, the Yun Family Research Fellows Fund for Revolutionary Thinking, and the Summer Institute in Biomedical Informatics at Harvard Medical School. V.G. is supported by the Medical Research Council, MR/W00710X/1. Y.E. is supported by grant T32 HG002295 from the National Human Genome Research Institute and the NSDEG fellowship. The authors would like to thank Owen Queen, Alejandro Velez-Arce, and Ruth Johnson for their constructive comments on the draft manuscript. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funders.
Author contributions
All authors contributed to the design and writing of the manuscript, helped shape the research, provided critical feedback, and commented on the manuscript and its revisions. M.Z. conceived the study and was in charge of overall direction and planning.
Autonomy | Biomedical discovery | Scientist-AI agent roles | ||
---|---|---|---|---|
levels | Hypothesis | Experiment | Reasoning | |
\rowcolorSkyBlue!5 Level 0: No AI agent | None | ML models perform predefined tasks, with no adaptive changes to the protocols | None | Scientist defines the hypothesis and sometimes uses the output of ML models to help with their generation Scientist defines the task to test hypothesis Scientist completes tasks |
\rowcolorSkyBlue!15 Level 1: AI agent as an assistant | AI agent formulates simple and narrow hypotheses that are a direct composition of existing knowledge, preliminary data, or observations | Narrow design of experimental protocols and utilization of in silico and experimental tools | Strong reasoning in a selected task Multi-modal summary of findings Use of experimental data and existing knowledge | Scientist defines the hypothesis Scientist defines the series of tasks to test hypothesis AI agent completes tasks |
\rowcolorSkyBlue!25 Level 2: AI agent as a collaborator | AI agent generates hypotheses that are an explicit continuation of data trends and known literature | Design of rigorous experimental protocols and adept utilization of a broad range of ex silico tools Once data is collected, employ statistical and computational methods to analyze the results and interpret the data to determine whether it supports or refutes the hypothesis | Interpreting findings within existing knowledge, considering alternative explanations, and assessing the reliability and validity of the findings Synthesis of concepts beyond a summary of findings Collaborating with other researchers and undergoing peer review to validate findings and ensures that conclusions are robust and credible | Scientist proposes initial hypothesis and refines hypothesis with AI agent AI agent defines the series of tasks to test hypothesis AI agent completes tasks |
\rowcolorSkyBlue!35 Level 3: AI agent as a scientist | AI agent generates creative, de novo hypotheses that are indirect extrapolations from existing knowledge. | Development of experimental methods unlocking new capabilities Actively gather data through experiments, observations, or simulations using various techniques and tools to measure and record biological phenomena accurately | Based on the results and interpretations, refine and experimental approaches for continuous learning and adaptation to improve the accuracy and depth of understanding Concise, informative and clear conceptual links between findings | Scientist and AI agent together form hypothesis AI agent defines the series of tasks to test hypothesis AI agent completes tasks |
Autonomy levels | Genetics (mutational effect modeling) | Cell biology (drug resistance) | Chemical biology (binder design) |
---|---|---|---|
\rowcolorSkyBlue!5 Level 0 | Statistical package to analyze a pre-selected GWAS study. | Use of ML tools for modeling cellular outcomes of drug perturbations, including cell imaging, omics, and viability. | Use of ML tools for protein structure prediction, molecular docking, and generative models for binder design. |
\rowcolorSkyBlue!15 Level 1 | To explore potential mutational associations with disease, writes bioinformatics software for quality control and statistical analysis of genotype data from pre-fetched relevant GWAS studies. | Integrates multimodal (imaging, omics, viability) and multiscale (cellular, tissue) data to create in silico models of drug resistance. Retrieves and executes existing experimental protocols to study resistance. Analyzes raw image and omics data with predefined pipelines. | Studies a specific protein target, integrates ML tools, such as AlphaFold for structure prediction and neural networks for screening chemical libraries to find candidate chemical compounds to bind to the target. |
\rowcolorSkyBlue!25 Level 2 | Selects GWAS studies relevant to a provided hypothesis. If none exists, it designs and executes its own study or pulls other relevant genomic data to investigate the hypothesis. | Autonomously develops and adaptively refines hypotheses about resistance mechanisms based on knowledge and real-time experimental data analytics. Designs and executes scalable and cost-effective experimental protocols with experts in the loop. | Designs binders for more challenging targets. Identifies scaffolds that bind to similar pockets and adapts them for the target. Synthesizes and tests molecules using existing experimental techniques. |
\rowcolorSkyBlue!35 Level 3 | Initiates genomic studies and optimizes non-invasive methods of DNA collection for cost-effectiveness and ease of participant requirement. Innovates statistical methods to identify causal variants from genotypic data and develops in vitro techniques for validating candidate gene markers in disease models. | Proactively identifies critical unresolved problems in drug resistance, proposing innovative therapeutic strategies. Performs in silico simulations of cellular dynamics in tumor contexts and under complex perturbations (combinatorial genetic and chemical perturbations under different dosing schedules). Develops novel highly multiplexed in vivo single-cell spatial technologies, enabling live tracking of gene expression, molecular interactions, and cell-cell interactions during resistance evolution. | Proposes de novo binders for an undruggable target or a poorly studied target. Designs in situ experiments to study molecular interactions. Synthesizes molecules with more complex pathways and designs and executes assays to test efficacy. |
Term | Description |
Multi-modal foundation model | Advanced algorithms trained on multimodal datasets that can process various data types, including text, images, biological sequences, and high-dimensional tabular readouts. This training allows them to perform a broad array of tasks through few-shot fine-tuning and prompting across domains with little to no additional training |
Transformer architecture | Deep learning model architecture that uses on self-attention mechanism to capture long-range dependencies in input sequence data |
Large language model | Machine learning model with parameters on the scale of billions, trained on vast amounts of text data to understand, generate, and interact with human language on a large scale |
Generative pretraining | Strategy for training a machine learning model in an autoregressive manner to predict the next token from given data tokens, facilitating a general understanding of data sequence likelihoods |
LLM-based AI agent | AI system capable of solving complex tasks within its environment by equipping large language model with modules for perception, interaction, memory, and reasoning |
Embodied AI agent | AI agent system that interacts with the physical world through a body. The embodiment enables the agent to learn and adapt from sensory feedback and physical interactions |
Fine-tuning | A training process of making small adjustments to a pre-trained machine learning model to improve its accuracy on a specific task or dataset |
Instruction tuning | A training strategy that fine-tunes a model using a dataset of instructions and corresponding outputs to enhance its ability to follow specific instructions |
Reinforcement learning with human feedback | A reinforcement learning strategy where an action model learns to perform tasks by receiving feedback from a reward model that mimics human preferences, guiding it to align with desired human behaviors |
Prompting | Techniques that provide specific text or other modal input instructions to guide the model in responding toward a desired output direction |
Cross-modal alignment | A training scheme to align the representation embeddings of models across various modalities |
In-context learning | Ability to perform new tasks based on a handful of examples provided within the contextual prompt, without requiring explicit model training |
Retrieval-augmented generation | Techniques that make generative models to produce contextually relevant text by retrieving pertinent information and using it to inform the generation process |
Term | Description |
Linkage disequilibrium | A phenomenon in which two alleles occur so often in proximity in the chromosome that their association cannot be random |
Single-nucleotide
polymorphisms |
Genetic variation consisting of the replacement of a single nucleotide in the DNA sequence |
Genome-wide association study | Approach that identifies genetic variations across the entire genome associated with a specific disease or complex trait |
Pharmacogenetics | Field of research that aims to understand individuals’ responses to different drugs based on their genetic factors |
Experiment in-vitro | Procedures and investigations that occur within a laboratory environment (e.g., in a test tube) and outside of living organisms |
In silico modeling | The use of computers to build simulations or experiments that recreate complex biological phenomena in order to be able to study and predict specific behaviors. For example, modeling of molecular dynamics |
Mass spectrometry | Analytical tools to characterize and identify individual molecules based on specific properties (e.g., mass-to-charge ratio) |
Molecular docking | Computational simulation tools used to predict how ligands bind to receptors |
Retro-synthesis | Techniques to design the synthesis of complex molecules by starting from the target and moving back to the original compounds |
Crystallography | Field of science studying the structure of atoms and molecules in crystals, which are solid materials whose compounds are ordered according to a very regular and ordered arrangement |
Cryo-electron
microscopy |
Imaging techniques used to identify the 3D structure of bio-molecules with near-atomic resolution without the need for extensive sample preparation and with the overall preservation of the sample |
References
- Boiko et al. 2023 Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023.
- Bran et al. 2023 Andres M Bran, Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew White, and Philippe Schwaller. Augmenting large language models with chemistry tools. In NeurIPS 2023 AI for Science Workshop, 2023. URL https://openreview.net/forum?id=wdGIL6lx3l.
- Xi et al. 2023 Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
- Guo et al. 2024 Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680, 2024.
- Wang et al. 2023a Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023a.
- Touvron et al. 2023 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Team et al. 2023 Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Radford et al. 2018 Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
- Vemprala et al. 2023 Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
- Gravitas 2023 Significant Gravitas. Autogpt, 2023. URL https://agpt.co.
- Yao et al. 2023a Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=WE_vluYUL-X.
- Shinn et al. 2023 Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Wu et al. 2023a Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023a.
- Singh et al. 2023 Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, 2023. doi: 10.1109/ICRA48891.2023.10161317.
- Huang et al. 2022a Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022a.
- Krenn et al. 2022 Mario Krenn, Robert Pollice, Si Yue Guo, Matteo Aldeghi, Alba Cervera-Lierta, Pascal Friederich, Gabriel dos Passos Gomes, Florian Häse, Adrian Jinich, AkshatKumar Nigam, et al. On scientific understanding with artificial intelligence. Nature Reviews Physics, 4(12):761–769, 2022.
- Sun et al. 2024 Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al. Trustllm: Trustworthiness in large language models. arXiv:2401.05561, 2024.
- Kotha et al. 2023 Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference. arXiv:2309.10105, 2023.
- Li et al. 2023a Hanzhou Li, John T Moon, Saptarshi Purkayastha, Leo Anthony Celi, Hari Trivedi, and Judy W Gichoya. Ethics of large language models in medicine and medical research. The Lancet Digital Health, 5(6):e333–e335, 2023a.
- Goetz et al. 2023 Lea Goetz, Markus Trengove, Artem Trotsyuk, and Carole A Federico. Unreliable llm bioethics assistants: Ethical and pedagogical risks. The American Journal of Bioethics, 23(10):89–91, 2023.
- Kumar et al. 2024 Ashutosh Kumar, Sagarika Singh, Shiv Vignesh Murty, and Swathy Ragupathy. The ethics of interaction: Mitigating security threats in llms. arXiv:2401.12273, 2024.
- Sachs et al. 2005 Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005.
- Rao et al. 2021 Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. Msa transformer. In International Conference on Machine Learning, pages 8844–8856. PMLR, 2021.
- Lin et al. 2023 Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
- Baek et al. 2021 Minkyung Baek, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, Qian Cong, Lisa N Kinch, R Dustin Schaeffer, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021.
- Alipanahi et al. 2015 Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Nature biotechnology, 33(8):831–838, 2015.
- Theodoris et al. 2023 Christina V Theodoris, Ling Xiao, Anant Chopra, Mark D Chaffin, Zeina R Al Sayed, Matthew C Hill, Helene Mantineo, Elizabeth M Brydon, Zexian Zeng, X Shirley Liu, et al. Transfer learning enables predictions in network biology. Nature, pages 1–9, 2023.
- Yu et al. 2016 Michael Ku Yu, Michael Kramer, Janusz Dutkowski, Rohith Srivas, Katherine Licon, Jason F Kreisberg, Cherie T Ng, Nevan Krogan, Roded Sharan, and Trey Ideker. Translation of genotype to phenotype by a hierarchy of cell subsystems. Cell systems, 2(2):77–88, 2016.
- Singh et al. 2002 Dinesh Singh, Phillip G Febbo, Kenneth Ross, Donald G Jackson, Judith Manola, Christine Ladd, Pablo Tamayo, Andrew A Renshaw, Anthony V D’Amico, Jerome P Richie, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2):203–209, 2002.
- Shipp et al. 2002 Margaret A Shipp, Ken N Ross, Pablo Tamayo, Andrew P Weng, Jeffery L Kutok, Ricardo CT Aguiar, Michelle Gaasenbeek, Michael Angelo, Michael Reich, Geraldine S Pinkus, et al. Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine, 8(1):68–74, 2002.
- Kuenzi et al. 2020 Brent M Kuenzi, Jisoo Park, Samson H Fong, Kyle S Sanchez, John Lee, Jason F Kreisberg, Jianzhu Ma, and Trey Ideker. Predicting drug response and synergy using a deep learning model of human cancer cells. Cancer cell, 38(5):672–684, 2020.
- Ren et al. 2024 Feng Ren, Alex Aliper, Jian Chen, Heng Zhao, Sujata Rao, Christoph Kuppe, Ivan V. Ozerov, Man Zhang, Klaus Witte, Chris Kruse, Vladimir Aladinskiy, Yan Ivanenkov, Daniil Polykovskiy, Yanyun Fu, Eugene Babin, Junwen Qiao, Xing Liang, Zhenzhen Mou, Hui Wang, Frank W. Pun, Pedro Torres Ayuso, Alexander Veviorskiy, Dandan Song, Sang Liu, Bei Zhang, Vladimir Naumov, Xiaoqiang Ding, Andrey Kukharenko, Evgeny Izumchenko, and Alex Zhavoronkov. A small-molecule tnik inhibitor targets fibrosis in preclinical and clinical models. Nature Biotechnology, March 2024. ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-024-02143-0. URL https://www.nature.com/articles/s41587-024-02143-0.
- Stokes et al. 2020 Jonathan M Stokes, Kevin Yang, Kyle Swanson, Wengong Jin, Andres Cubillos-Ruiz, Nina M Donghia, Craig R MacNair, Shawn French, Lindsey A Carfrae, Zohar Bloom-Ackermann, et al. A deep learning approach to antibiotic discovery. Cell, 180(4):688–702, 2020.
- Berman et al. 2000 Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
- Consortium et al. 2012 1000 Genomes Project Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56, 2012.
- Wishart et al. 2006 David S Wishart, Craig Knox, An Chi Guo, Savita Shrivastava, Murtaza Hassanali, Paul Stothard, Zhan Chang, and Jennifer Woolsey. Drugbank: a comprehensive resource for in silico drug discovery and exploration. Nucleic acids research, 34(suppl_1):D668–D672, 2006.
- Varadi et al. 2022 Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic acids research, 50(D1):D439–D444, 2022.
- Jumper et al. 2021 John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Brin and Page 1998 Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7):107–117, 1998.
- Altschul et al. 1990 Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403–410, 1990.
- Gaulton et al. 2017 Anna Gaulton, Anne Hersey, Michał Nowotka, A Patricia Bento, Jon Chambers, David Mendez, Prudence Mutowo, Francis Atkinson, Louisa J Bellis, Elena Cibrián-Uhalte, et al. The chembl database in 2017. Nucleic acids research, 45(D1):D945–D954, 2017.
- Van Kempen et al. 2023 Michel Van Kempen, Stephanie S Kim, Charlotte Tumescheit, Milot Mirdita, Jeongjae Lee, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. Fast and accurate protein structure search with foldseek. Nature Biotechnology, pages 1–4, 2023.
- Zhang et al. 2023a Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023a.
- Lála et al. 2023 Jakub Lála, Odhran O’Donoghue, Aleksandar Shtedritski, Sam Cox, Samuel G Rodriques, and Andrew D White. Paperqa: Retrieval-augmented generative agent for scientific research. arXiv preprint arXiv:2312.07559, 2023.
- Krizhevsky et al. 2012 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- He et al. 2016 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Vaswani et al. 2017 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Hernández-García et al. 2024 Alex Hernández-García, Nikita Saxena, Moksh Jain, Cheng-Hao Liu, and Yoshua Bengio. Multi-fidelity active learning with GFlownets, 2024. URL https://openreview.net/forum?id=3QR230r11w.
- Ouyang et al. 2022 Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
- Zhavoronkov et al. 2019 Alex Zhavoronkov, Yan A Ivanenkov, Alex Aliper, Mark S Veselov, Vladimir A Aladinskiy, Anastasiya V Aladinskaya, Victor A Terentiev, Daniil A Polykovskiy, Maksim D Kuznetsov, Arip Asadulaev, et al. Deep learning enables rapid identification of potent ddr1 kinase inhibitors. Nature biotechnology, 37(9):1038–1040, 2019.
- Hie and Yang 2022 Brian L. Hie and Kevin K. Yang. Adaptive machine learning for protein engineering. Current Opinion in Structural Biology, 72:145–152, February 2022. ISSN 0959440X. doi: 10.1016/j.sbi.2021.11.002. URL https://linkinghub.elsevier.com/retrieve/pii/S0959440X21001457.
- Lutz et al. 2023 Isaac D. Lutz, Shunzhi Wang, Christoffer Norn, Alexis Courbet, Andrew J. Borst, Yan Ting Zhao, Annie Dosey, Longxing Cao, Jinwei Xu, Elizabeth M. Leaf, Catherine Treichel, Patrisia Litvicov, Zhe Li, Alexander D. Goodson, Paula Rivera-Sánchez, Ana-Maria Bratovianu, Minkyung Baek, Neil P. King, Hannele Ruohola-Baker, and David Baker. Top-down design of protein architectures with reinforcement learning. Science, 380(6642):266–273, April 2023. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.adf6591.
- Bailey et al. 2023 Michael Bailey, Saeed Moayedpour, Ruijiang Li, Alejandro Corrochano-Navarro, Alexander Kötter, Lorenzo Kogler-Anele, Saleh Riahi, Christoph Grebner, Gerhard Hessler, Hans Matter, et al. Deep batch active learning for drug discovery. bioRxiv, pages 2023–07, 2023.
- Soleimany et al. 2021 Ava P. Soleimany, Alexander Amini, Samuel Goldman, Daniela Rus, Sangeeta N. Bhatia, and Connor W. Coley. Evidential deep learning for guided molecular property prediction and discovery. ACS Central Science, 7(8):1356–1367, August 2021. ISSN 2374-7943, 2374-7951. doi: 10.1021/acscentsci.1c00546.
- Zhang et al. 2023b Jiaqi Zhang, Louis Cammarata, Chandler Squires, Themistoklis P. Sapsis, and Caroline Uhler. Active learning for optimal intervention design in causal models. Nature Machine Intelligence, 5(10):1066–1075, October 2023b. ISSN 2522-5839. doi: 10.1038/s42256-023-00719-0.
- Yala et al. 2022 Adam Yala, Peter G. Mikhael, Constance Lehman, Gigin Lin, Fredrik Strand, Yung-Liang Wan, Kevin Hughes, Siddharth Satuluru, Thomas Kim, Imon Banerjee, Judy Gichoya, Hari Trivedi, and Regina Barzilay. Optimizing risk-based breast cancer screening policies with reinforcement learning. Nature Medicine, 28(1):136–143, January 2022. ISSN 1078-8956, 1546-170X. doi: 10.1038/s41591-021-01599-w. URL https://www.nature.com/articles/s41591-021-01599-w.
- Sumers et al. 2024 Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. Cognitive architectures for language agents. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=1i6ZCvflQJ. Survey Certification.
- Wang et al. 2024 Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):1–26, 2024.
- Wei et al. 2022a Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=gEZrGCozdqR.
- Huang et al. 2023a Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316, 2023a.
- Nori et al. 2023 Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452, 2023.
- Park et al. 2023 Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
- Luo et al. 2022 Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6):bbac409, 2022.
- Jiang et al. 2023 Lavender Yao Jiang, Xujin Chris Liu, Nima Pour Nejatian, Mustafa Nasir-Moin, Duo Wang, Anas Abidin, Kevin Eaton, Howard Antony Riina, Ilya Laufer, Paawan Punjabi, et al. Health system-scale language models are all-purpose prediction engines. Nature, pages 1–6, 2023.
- Singhal et al. 2023a Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a.
- Singhal et al. 2023b Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023b.
- Brown et al. 2020 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
- Wang et al. 2023b Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. In Intrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023, 2023b. URL https://openreview.net/forum?id=nfx5IutEed.
- Fernando et al. 2024 Chrisantha Fernando, Dylan Sunil Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution, 2024. URL https://openreview.net/forum?id=HKkiX32Zw1.
- Yang et al. 2023 Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
- LeCun 2022 Yann LeCun. A path towards autonomous machine intelligence. Open Review, 62(1), 2022.
- Liang et al. 2023a Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. arXiv preprint arXiv:2310.01783, 2023a.
- Chen et al. 2023 Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007, 2023.
- Sanders et al. 2023 Lauren M Sanders, Ryan T Scott, Jason H Yang, Amina Ann Qutub, Hector Garcia Martin, Daniel C Berrios, Jaden JA Hastings, Jon Rask, Graham Mackintosh, Adrienne L Hoarfrost, et al. Biological research and self-driving labs in deep space supported by artificial intelligence. Nature Machine Intelligence, 5(3):208–219, 2023.
- Davies et al. 2021 Alex Davies, Petar Veličković, Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Tomašev, Richard Tanburn, Peter Battaglia, Charles Blundell, András Juhász, et al. Advancing mathematics by guiding human intuition with ai. Nature, 600(7887):70–74, 2021.
- Tshitoyan et al. 2019 Vahe Tshitoyan, John Dagdelen, Leigh Weston, Alexander Dunn, Ziqin Rong, Olga Kononova, Kristin A Persson, Gerbrand Ceder, and Anubhav Jain. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763):95–98, 2019.
- Jablonka et al. 2024 Kevin Maik Jablonka, Philippe Schwaller, Andres Ortega-Guerrero, and Berend Smit. Leveraging large language models for predictive chemistry. Nature Machine Intelligence, 6(2):161–169, 2024.
- Glass and Hall 2008 David J Glass and Ned Hall. A brief history of the hypothesis. Cell, 134(3):378–381, 2008.
- Lim et al. 2023 Yang Lim, Lukas Tamayo-Orrego, Ernst Schmid, Zygimante Tarnauskaite, Olga V Kochenova, Rhian Gruar, Sachiko Muramatsu, Luke Lynch, Aitana Verdu Schlie, Paula L Carroll, et al. In silico protein interaction screening uncovers donson’s role in replication initiation. Science, 381(6664):eadi3448, 2023.
- Wei et al. 2022b Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
- Zhou et al. 2023a Juexiao Zhou, Bin Zhang, Xiuying Chen, Haoyang Li, Xiaopeng Xu, Siyuan Chen, and Xin Gao. Automated bioinformatics analysis via autoba. arXiv preprint arXiv:2309.03242, 2023a.
- Tang et al. 2023 Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023.
- Hu et al. 2023a Xiuyuan Hu, Guoqing Liu, Yang Zhao, and Hao Zhang. De novo drug design using reinforcement learning with multiple gpt agents. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
- Morris et al. 2023 Meredith Ringel Morris, Jascha Sohl-dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Levels of agi: Operationalizing progress on the path to agi. arXiv preprint arXiv:2311.02462, 2023.
- Urbina et al. 2022 Fabio Urbina, Filippa Lentzos, Cédric Invernizzi, and Sean Ekins. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4(3):189–191, March 2022. ISSN 2522-5839. doi: 10.1038/s42256-022-00465-9. URL https://www.nature.com/articles/s42256-022-00465-9.
- Tang et al. 2024 Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, and Mark Gerstein. Prioritizing safeguarding over autonomy: Risks of llm agents for science. arXiv preprint arXiv:2402.04247, 2024.
- Baker and Church 2024 David Baker and George Church. Protein design meets biosecurity, 2024.
- Marees et al. 2018 Andries T Marees, Hilde de Kluiver, Sven Stringer, Florence Vorspan, Emmanuel Curis, Cynthia Marie-Claire, and Eske M Derks. A tutorial on conducting genome-wide association studies: Quality control and statistical analysis. Int. J. Methods Psychiatr. Res., 27(2):e1608, June 2018.
- Uffelmann et al. 2021 Emil Uffelmann, Qin Qin Huang, Nchangwi Syntia Munung, Jantina de Vries, Yukinori Okada, Alicia R. Martin, Hilary C. Martin, Tuuli Lappalainen, and Danielle Posthuma. Genome-wide association studies. Nature Reviews Methods Primers, 1(1):59, Aug 2021. ISSN 2662-8449. doi: 10.1038/s43586-021-00056-9. URL https://doi.org/10.1038/s43586-021-00056-9.
- Frueh 2010 Felix W Frueh. Real-world clinical effectiveness, regulatory transparency and payer coverage: three ingredients for translating pharmacogenomics into clinical practice. Pharmacogenomics, 11(5):657–660, May 2010.
- Panayiotopoulos 2005 CP Panayiotopoulos. The Epilepsies: Seizures, Syndromes and Management. Bladon Medical Publishing, Oxfordshire, UK, 2005.
- on Complex Epilepsies 2018 International League Against Epilepsy Consortium on Complex Epilepsies. Gwas meta-analysis of over 29,000 people with epilepsy identifies 26 risk loci and subtype-specific genetic architecture. Nature Communications, 9(1):1–12, 2018.
- Sudlow et al. 2015 Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779, 2015.
- Gamirova et al. 2024 Regina Gamirova, Elena Shagimardanova, Takehiro Sato, Takayuki Kannon, Rimma Gamirova, and Atsushi Tajima. Identification of potential disease-associated variants in idiopathic generalized epilepsy using targeted sequencing. Journal of Human Genetics, 69(2):59–67, Feb 2024. ISSN 1435-232X. doi: 10.1038/s10038-023-01208-3. URL https://doi.org/10.1038/s10038-023-01208-3.
- Oliver et al. 2023 Karen L Oliver, Ingrid E Scheffer, Mark F Bennett, Bronwyn E Grinton, Melanie Bahlo, and Samuel F Berkovic. Genes4Epilepsy: An epilepsy gene resource. Epilepsia, 64(5):1368–1375, May 2023.
- Salowe et al. 2022 Rebecca J Salowe, Roy Lee, Selam Zenebe-Gete, Marquis Vaughn, Harini V Gudiseva, Maxwell Pistilli, Ava Kikut, Emily Becker, David W Collins, Jie He, Sayaka Merriam, Kristen Mulvihill, Nora Laberee, Sara Lomax-Reese, Windell Murphy, Jeffrey Henderer, Venkata R M Chavali, Qi N Cui, Ahmara G Ross, Victoria Addis, Prithvi S Sankar, Eydie Miller-Ellis, Maureen G Maguire, and Joan M O’Brien. Recruitment strategies and lessons learned from a large genetic study of African Americans. PLOS Glob. Public Health, 2(8):e0000416, August 2022.
- Aissani 2014 Brahim Aissani. Confounding by linkage disequilibrium. Journal of Human Genetics, 59(2):110–115, 2014. ISSN 1435-232X. doi: 10.1038/jhg.2013.130. URL https://doi.org/10.1038/jhg.2013.130.
- Regev et al. 2017 Aviv Regev, Sarah A Teichmann, Eric S Lander, Ido Amit, Christophe Benoist, Ewan Birney, Bernd Bodenmiller, Peter Campbell, Piero Carninci, Menna Clatworthy, Hans Clevers, Bart Deplancke, Ian Dunham, James Eberwine, Roland Eils, Wolfgang Enard, Andrew Farmer, Lars Fugger, Berthold Göttgens, Nir Hacohen, Muzlifah Haniffa, Martin Hemberg, Seung Kim, Paul Klenerman, Arnold Kriegstein, Ed Lein, Sten Linnarsson, Emma Lundberg, Joakim Lundeberg, Partha Majumder, John C Marioni, Miriam Merad, Musa Mhlanga, Martijn Nawijn, Mihai Netea, Garry Nolan, Dana Pe’er, Anthony Phillipakis, Chris P Ponting, Stephen Quake, Wolf Reik, Orit Rozenblatt-Rosen, Joshua Sanes, Rahul Satija, Ton N Schumacher, Alex Shalek, Ehud Shapiro, Padmanee Sharma, Jay W Shin, Oliver Stegle, Michael Stratton, Michael J T Stubbington, Fabian J Theis, Matthias Uhlen, Alexander Van Oudenaarden, Allon Wagner, Fiona Watt, Jonathan Weissman, Barbara Wold, Ramnik Xavier, Nir Yosef, and Human Cell Atlas Meeting Participants. The human cell atlas. eLife, 6:e27041, December 2017. ISSN 2050-084X. doi: 10.7554/eLife.27041. URL https://elifesciences.org/articles/27041.
- Subramanian et al. 2017 Aravind Subramanian, Rajiv Narayan, Steven M. Corsello, David D. Peck, Ted E. Natoli, Xiaodong Lu, Joshua Gould, John F. Davis, Andrew A. Tubelli, Jacob K. Asiedu, David L. Lahr, Jodi E. Hirschman, Zihan Liu, Melanie Donahue, Bina Julian, Mariya Khan, David Wadden, Ian C. Smith, Daniel Lam, Arthur Liberzon, Courtney Toder, Mukta Bagul, Marek Orzechowski, Oana M. Enache, Federica Piccioni, Sarah A. Johnson, Nicholas J. Lyons, Alice H. Berger, Alykhan F. Shamji, Angela N. Brooks, Anita Vrcic, Corey Flynn, Jacqueline Rosains, David Y. Takeda, Roger Hu, Desiree Davison, Justin Lamb, Kristin Ardlie, Larson Hogstrom, Peyton Greenside, Nathanael S. Gray, Paul A. Clemons, Serena Silver, Xiaoyun Wu, Wen-Ning Zhao, Willis Read-Button, Xiaohua Wu, Stephen J. Haggarty, Lucienne V. Ronco, Jesse S. Boehm, Stuart L. Schreiber, John G. Doench, Joshua A. Bittker, David E. Root, Bang Wong, and Todd R. Golub. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell, 171(6):1437–1452.e17, November 2017. ISSN 00928674. doi: 10.1016/j.cell.2017.10.049. URL https://linkinghub.elsevier.com/retrieve/pii/S0092867417313090.
- Mitchell et al. 2023 Dylan C. Mitchell, Miljan Kuljanin, Jiaming Li, Jonathan G. Van Vranken, Nathan Bulloch, Devin K. Schweppe, Edward L. Huttlin, and Steven P. Gygi. A proteome-wide atlas of drug mechanism of action. Nature Biotechnology, 41(6):845–857, June 2023. ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-022-01539-0. URL https://www.nature.com/articles/s41587-022-01539-0.
- Ghandi et al. 2019 Mahmoud Ghandi, Franklin W. Huang, Judit Jané-Valbuena, Gregory V. Kryukov, Christopher C. Lo, E. Robert McDonald, Jordi Barretina, Ellen T. Gelfand, Craig M. Bielski, Haoxin Li, Kevin Hu, Alexander Y. Andreev-Drakhlin, Jaegil Kim, Julian M. Hess, Brian J. Haas, François Aguet, Barbara A. Weir, Michael V. Rothberg, Brenton R. Paolella, Michael S. Lawrence, Rehan Akbani, Yiling Lu, Hong L. Tiv, Prafulla C. Gokhale, Antoine De Weck, Ali Amin Mansour, Coyin Oh, Juliann Shih, Kevin Hadi, Yanay Rosen, Jonathan Bistline, Kavitha Venkatesan, Anupama Reddy, Dmitriy Sonkin, Manway Liu, Joseph Lehar, Joshua M. Korn, Dale A. Porter, Michael D. Jones, Javad Golji, Giordano Caponigro, Jordan E. Taylor, Caitlin M. Dunning, Amanda L. Creech, Allison C. Warren, James M. McFarland, Mahdi Zamanighomi, Audrey Kauffmann, Nicolas Stransky, Marcin Imielinski, Yosef E. Maruvka, Andrew D. Cherniack, Aviad Tsherniak, Francisca Vazquez, Jacob D. Jaffe, Andrew A. Lane, David M. Weinstock, Cory M. Johannessen, Michael P. Morrissey, Frank Stegmeier, Robert Schlegel, William C. Hahn, Gad Getz, Gordon B. Mills, Jesse S. Boehm, Todd R. Golub, Levi A. Garraway, and William R. Sellers. Next-generation characterization of the cancer cell line encyclopedia. Nature, 569(7757):503–508, May 2019. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-019-1186-3. URL https://www.nature.com/articles/s41586-019-1186-3.
- Chandrasekaran et al. 2022 Srinivas Niranj Chandrasekaran, Beth A Cimini, Amy Goodale, Lisa Miller, Maria Kost-Alimova, Nasim Jamali, John G Doench, Briana Fritchman, Adam Skepner, Michelle Melanson, et al. Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations. Biorxiv, pages 2022–01, 2022.
- De Teresa-Trueba et al. 2023 Irene De Teresa-Trueba, Sara K. Goetz, Alexander Mattausch, Frosina Stojanovska, Christian E. Zimmerli, Mauricio Toro-Nahuelpan, Dorothy W. C. Cheng, Fergus Tollervey, Constantin Pape, Martin Beck, Alba Diz-Muñoz, Anna Kreshuk, Julia Mahamid, and Judith B. Zaugg. Convolutional networks for supervised mining of molecular patterns within cellular context. Nature Methods, 20(2):284–294, February 2023. ISSN 1548-7091, 1548-7105. doi: 10.1038/s41592-022-01746-2. URL https://www.nature.com/articles/s41592-022-01746-2.
- Schiøtz et al. 2023 Oda Helene Schiøtz, Christoph JO Kaiser, Sven Klumpe, Dustin R Morado, Matthias Poege, Jonathan Schneider, Florian Beck, David P Klebl, Christopher Thompson, and Jürgen M Plitzko. Serial lift-out: sampling the molecular anatomy of whole organisms. Nature Methods, pages 1–9, 2023.
- Lundberg and Borner 2019 Emma Lundberg and Georg H. H. Borner. Spatial proteomics: a powerful discovery tool for cell biology. Nature Reviews Molecular Cell Biology, 20(5):285–302, May 2019. ISSN 1471-0072, 1471-0080. doi: 10.1038/s41580-018-0094-y. URL https://www.nature.com/articles/s41580-018-0094-y.
- Cho et al. 2022 Nathan H. Cho, Keith C. Cheveralls, Andreas-David Brunner, Kibeom Kim, André C. Michaelis, Preethi Raghavan, Hirofumi Kobayashi, Laura Savy, Jason Y. Li, Hera Canaj, James Y. S. Kim, Edna M. Stewart, Christian Gnann, Frank McCarthy, Joana P. Cabrera, Rachel M. Brunetti, Bryant B. Chhun, Greg Dingle, Marco Y. Hein, Bo Huang, Shalin B. Mehta, Jonathan S. Weissman, Rafael Gómez-Sjöberg, Daniel N. Itzhak, Loïc A. Royer, Matthias Mann, and Manuel D. Leonetti. Opencell: Endogenous tagging for the cartography of human cellular organization. Science, 375(6585):eabi6983, March 2022. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.abi6983. URL https://www.science.org/doi/10.1126/science.abi6983.
- Johnson et al. 2023 Graham T. Johnson, Eran Agmon, Matthew Akamatsu, Emma Lundberg, Blair Lyons, Wei Ouyang, Omar A. Quintero-Carmona, Megan Riel-Mehan, Susanne Rafelski, and Rick Horwitz. Building the next generation of virtual cells to understand cellular biology. Biophysical Journal, 122(18):3560–3569, September 2023. ISSN 00063495. doi: 10.1016/j.bpj.2023.04.006. URL https://linkinghub.elsevier.com/retrieve/pii/S0006349523002369.
- Li et al. 2024 Michelle M Li, Yepeng Huang, Marissa Sumathipala, Man Qing Liang, Alberto Valdeolivas, Ashwin N Ananthakrishnan, Katherine Liao, Daniel Marbach, and Marinka Zitnik. Contextualizing protein representations using deep learning on protein networks and single-cell data. Nature Methods, 2024.
- Russell et al. 2024 Andrew J. C. Russell, Jackson A. Weir, Naeem M. Nadaf, Matthew Shabet, Vipin Kumar, Sandeep Kambhampati, Ruth Raichur, Giovanni J. Marrero, Sophia Liu, Karol S. Balderrama, Charles R. Vanderburg, Vignesh Shanmugam, Luyi Tian, J. Bryan Iorgulescu, Charles H. Yoon, Catherine J. Wu, Evan Z. Macosko, and Fei Chen. Slide-tags enables single-nucleus barcoding for multimodal spatial genomics. Nature, 625(7993):101–109, January 2024. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-023-06837-4. URL https://www.nature.com/articles/s41586-023-06837-4.
- Wik et al. 2021 Lotta Wik, Niklas Nordberg, John Broberg, Johan Björkesten, Erika Assarsson, Sara Henriksson, Ida Grundberg, Erik Pettersson, Christina Westerberg, Elin Liljeroth, Adam Falck, and Martin Lundberg. Proximity extension assay in combination with next-generation sequencing for high-throughput proteome-wide analysis. Molecular and Cellular Proteomics, 20:100168, 2021. ISSN 15359476. doi: 10.1016/j.mcpro.2021.100168. URL https://linkinghub.elsevier.com/retrieve/pii/S1535947621001407.
- Liu et al. 2023a Yang Liu, Marcello DiStasio, Graham Su, Hiromitsu Asashima, Archibald Enninful, Xiaoyu Qin, Yanxiang Deng, Jungmin Nam, Fu Gao, Pino Bordignon, Marco Cassano, Mary Tomayko, Mina Xu, Stephanie Halene, Joseph E. Craft, David Hafler, and Rong Fan. High-plex protein and whole transcriptome co-mapping at cellular resolution with spatial cite-seq. Nature Biotechnology, 41(10):1405–1409, October 2023a. ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-023-01676-0. URL https://www.nature.com/articles/s41587-023-01676-0.
- Yoshikawa et al. 2023 Naruki Yoshikawa, Kourosh Darvish, Mohammad Ghazi Vakili, Animesh Garg, and Alán Aspuru-Guzik. Digital pipette: open hardware for liquid transfer in self-driving laboratories. Digital Discovery, 2(6):1745–1751, 2023. doi: 10.1039/D3DD00115F. URL https://pubs.rsc.org/en/content/articlelanding/2023/dd/d3dd00115f.
- Dixit et al. 2016 Atray Dixit, Oren Parnas, Biyu Li, Jenny Chen, Charles P. Fulco, Livnat Jerby-Arnon, Nemanja D. Marjanovic, Danielle Dionne, Tyler Burks, Raktima Raychowdhury, Britt Adamson, Thomas M. Norman, Eric S. Lander, Jonathan S. Weissman, Nir Friedman, and Aviv Regev. Perturb-seq: Dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. Cell, 167(7):1853–1866.e17, December 2016. ISSN 00928674. doi: 10.1016/j.cell.2016.11.038. URL https://linkinghub.elsevier.com/retrieve/pii/S0092867416316105.
- Binan et al. 2023 Loc Binan, Serwah Danquah, Vera Valakh, Brooke Simonton, Jon Bezney, Ralda Nehme, Brian Cleary, and Samouil L Farhi. Simultaneous crispr screening and spatial transcriptomics reveals intracellular, intercellular, and functional transcriptional circuits. Biorxiv, 2023.
- Dang et al. 2017 Chi V Dang, E Premkumar Reddy, Kevan M Shokat, and Laura Soucek. Drugging the’undruggable’cancer targets. Nature Reviews Cancer, 17(8):502–508, 2017.
- Lieber et al. 2019 Toby Lieber, Swathi P Jeedigunta, Jonathan M Palozzi, Ruth Lehmann, and Thomas R Hurd. Mitochondrial fragmentation drives selective removal of deleterious mtdna in the germline. Nature, 570(7761):380–384, 2019.
- Li et al. 2023b Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
- Liu et al. 2023b Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=w0H2xGHlkw.
- Chen et al. 2020 Jiakun Chen, Kira E. Poskanzer, Marc R. Freeman, and Kelly R. Monk. Live-imaging of astrocyte morphogenesis and function in zebrafish neural circuits. Nature Neuroscience, 23(10):1297–1306, October 2020. ISSN 1097-6256, 1546-1726. doi: 10.1038/s41593-020-0703-x. URL https://www.nature.com/articles/s41593-020-0703-x.
- Driess et al. 2023 Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. PaLM-e: An embodied multimodal language model. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 8469–8488. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/driess23a.html.
- Li et al. 2021 Jiaming Li, Zhenying Cai, Laura Pontano Vaites, Ning Shen, Dylan C Mitchell, Edward L Huttlin, Joao A Paulo, Brian L Harry, and Steven P Gygi. Proteome-wide mapping of short-lived proteins in human cells. Molecular cell, 81(22):4722–4735, 2021.
- Radford et al. 2021 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Zhu et al. 2024 Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=1tZbq88f27.
- Bavishi et al. 2023 Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
- Kopp and Krämer 2021 Stefan Kopp and Nicole Krämer. Revisiting human-agent communication: The importance of joint co-construction and understanding mental states. Frontiers in Psychology, 12:580955, 2021.
- Huang et al. 2022b Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jackson, Noah Brown, Linda Luu, Sergey Levine, Karol Hausman, and brian ichter. Inner monologue: Embodied reasoning through planning with language models. In 6th Annual Conference on Robot Learning, 2022b. URL https://openreview.net/forum?id=3R3Pz5i0tye.
- Rafailov et al. 2023 Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Nascimento et al. 2023 Nathalia Nascimento, Paulo Alencar, and Donald Cowan. Self-adaptive large language model (llm)-based multiagent systems. In 2023 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C), pages 104–109. IEEE, 2023.
- Lowe et al. 2017 Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017.
- Hong et al. 2024 Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VtmBAGCN7o.
- Zhang et al. 2024a Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=EnXJfQqy0K.
- Liang et al. 2023b Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023b.
- Fu et al. 2023 Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023.
- Mandi et al. 2023 Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models. arXiv preprint arXiv:2307.04738, 2023.
- Saha et al. 2023 Swarnadeep Saha, Peter Hase, and Mohit Bansal. Can language models teach weaker agents? teacher explanations improve students via theory of mind. arXiv preprint arXiv:2306.09299, 2023.
- Williams et al. 2023 Ross Williams, Niyousha Hosseinichimeh, Aritra Majumdar, and Navid Ghaffarzadegan. Epidemic modeling with generative agents. arXiv preprint arXiv:2307.04986, 2023.
- Park et al. 2022 Joon Sung Park, Lindsay Popowski, Carrie Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1–18, 2022.
- Parisi et al. 2022 Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
- Schick et al. 2023 Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Yacmpz84TH.
- Nakano et al. 2021 Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Shen et al. 2023 Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with chatGPT and its friends in hugging face. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=yHdTscY6Ci.
- Hu et al. 2023b Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, and Hang Zhao. Chatdb: Augmenting llms with databases as their symbolic memory. arXiv preprint arXiv:2306.03901, 2023b.
- Coley et al. 2019 Connor W Coley, Dale A Thomas III, Justin AM Lummiss, Jonathan N Jaworski, Christopher P Breen, Victor Schultz, Travis Hart, Joshua S Fishman, Luke Rogers, Hanyu Gao, et al. A robotic platform for flow synthesis of organic compounds informed by ai planning. Science, 365(6453):eaax1566, 2019.
- Ahn et al. 2022 Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Ramesh et al. 2021 Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Hu et al. 2022 Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Qian et al. 2023 Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
- Zhou et al. 2023b Xuanhe Zhou, Guoliang Li, and Zhiyuan Liu. Llm as dba. arXiv preprint arXiv:2308.05481, 2023b.
- Zhu et al. 2023 Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, et al. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023.
- Neelakantan et al. 2022 Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
- Zhong et al. 2024a Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):19724–19731, Mar. 2024a. doi: 10.1609/aaai.v38i17.29946. URL https://ojs.aaai.org/index.php/AAAI/article/view/29946.
- Dettmers et al. 2023 Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=OUIFPHEgJU.
- Meng et al. 2022 Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. In Advances in Neural Information Processing Systems, 2022.
- Zhang et al. 2024b Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, et al. A comprehensive study of knowledge editing for large language models. arXiv preprint arXiv:2401.01286, 2024b.
- Rana et al. 2023 Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, and Niko Suenderhauf. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In 7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=wMpOMO0Ss7a.
- Chiang et al. 2023 Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Li et al. 2023c Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, and Sanjiv Kumar. Large language models with controllable working memory. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 1774–1793, Toronto, Canada, July 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.112. URL https://aclanthology.org/2023.findings-acl.112.
- Kojima et al. 2022 Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
- Liu et al. 2023c Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023c.
- Dagan et al. 2023 Gautier Dagan, Frank Keller, and Alex Lascarides. Dynamic planning with a llm. arXiv preprint arXiv:2308.06391, 2023.
- Zhang et al. 2023c Zhuosheng Zhang, Yao Yao, Aston Zhang, Xiangru Tang, Xinbei Ma, Zhiwei He, Yiming Wang, Mark Gerstein, Rui Wang, Gongshen Liu, et al. Igniting language intelligence: The hitchhiker’s guide from chain-of-thought reasoning to language agents. arXiv preprint arXiv:2311.11797, 2023c.
- Zhong et al. 2024b Shanshan Zhong, Zhongzhan Huang, Shanghua Gao, Wushao Wen, Liang Lin, Marinka Zitnik, and Pan Zhou. Let’s think outside the box: Exploring leap-of-thought in large language models with creative humor generation. In The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2024b.
- Raman et al. 2022 Shreyas Sundara Raman, Vanya Cohen, Eric Rosen, Ifrah Idrees, David Paulius, and Stefanie Tellex. Planning with large language models via corrective re-prompting. In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
- Yao et al. 2023b Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=5Xc1ecxO1h.
- Wang et al. 2023c Yancheng Wang, Ziyan Jiang, Zheng Chen, Fan Yang, Yingxue Zhou, Eunah Cho, Xing Fan, Xiaojiang Huang, Yanbin Lu, and Yingzhen Yang. Recmind: Large language model powered agent for recommendation. arXiv preprint arXiv:2308.14296, 2023c.
- Zhou et al. 2023c Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023c. URL https://openreview.net/forum?id=WZH7099tgfM.
- Wang et al. 2023d Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023d. URL https://openreview.net/forum?id=1PL1NIMMrw.
- Besta et al. 2024 Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024.
- Hao et al. 2023 Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=VTWWvYtF1R.
- Madaan et al. 2023 Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=S37hOerQLB.
- Song et al. 2023 Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023.
- Chen et al. 2024 Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KuPixIqPiq.
- Wang et al. 2023e Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah D Goodman. Hypothesis search: Inductive reasoning with language models. arXiv preprint arXiv:2309.05660, 2023e.
- McCoy et al. 2023 R Thomas McCoy, Shunyu Yao, Dan Friedman, Matthew Hardy, and Thomas L Griffiths. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
- Wu et al. 2023b Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023b.
- Nye et al. 2021 Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
- Chen and Mueller 2023 Jiuhai Chen and Jonas Mueller. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175, 2023.
- Tian et al. 2023 Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=g3faCfrwm7.
- Kuhn et al. 2023 Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=VD-AYtP0dve.
- Shafer and Vovk 2008a Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008a.
- Shafer and Vovk 2008b Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(12):371–421, 2008b. URL http://jmlr.org/papers/v9/shafer08a.html.
- Perez et al. 2023 Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis, Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse, Landon Goldberg, Liane Lovitt, Martin Lucas, Michael Sellitto, Miranda Zhang, Neerav Kingsland, Nelson Elhage, Nicholas Joseph, Noemi Mercado, Nova DasSarma, Oliver Rausch, Robin Larson, Sam McCandlish, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Jack Clark, Samuel R. Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, and Jared Kaplan. Discovering language model behaviors with model-written evaluations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.847. URL https://aclanthology.org/2023.findings-acl.847.
- Qi et al. 2024 Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hTEGyKf0dZ.
- Wei et al. 2023 Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=jA235JGM09.
- Bommasani et al. 2023 Rishi Bommasani, Percy Liang, and Tony Lee. Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525(1):140–146, 2023. doi: https://doi.org/10.1111/nyas.15007. URL https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.15007.
- Mialon et al. 2024 Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fibxvahvs3.
- Srivastava et al. 2023 Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
- Huang et al. 2023b Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Benchmarking large language models as ai research agents. ArXiv, abs/2310.03302, 2023b. URL https://api.semanticscholar.org/CorpusID:263671541.
- Liu et al. 2024 Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating LLMs as agents. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=zAdUB0aCTQ.
- Corsello et al. 2017 Steven M Corsello, Joshua A Bittker, Zihan Liu, Joshua Gould, Patrick McCarren, Jodi E Hirschman, Stephen E Johnston, Anita Vrcic, Bang Wong, Mariya Khan, et al. The drug repurposing hub: a next-generation drug library and information resource. Nature Medicine, 23(4):405–408, 2017.
- Cohen-Boulakia et al. 2017 Sarah Cohen-Boulakia, Khalid Belhajjame, Olivier Collin, Jérôme Chopard, Christine Froidevaux, Alban Gaignard, Konrad Hinsen, Pierre Larmande, Yvan Le Bras, Frédéric Lemoine, et al. Scientific workflows for computational reproducibility in the life sciences: Status, challenges and opportunities. Future Generation Computer Systems, 75:284–298, 2017.
- Lamprecht et al. 2020 Anna-Lena Lamprecht, Leyla Garcia, Mateusz Kuzak, Carlos Martinez, Ricardo Arcila, Eva Martin Del Pico, Victoria Dominguez Del Angel, Stephanie Van De Sandt, Jon Ison, Paula Andrea Martinez, et al. Towards fair principles for research software. Data Science, 3(1):37–59, 2020.
- Zitnik et al. 2019 Marinka Zitnik, Francis Nguyen, Bo Wang, Jure Leskovec, Anna Goldenberg, and Michael M. Hoffman. Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities. Information Fusion, 50:71–91, 2019. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2018.09.012. URL https://www.sciencedirect.com/science/article/pii/S1566253518304482.
- Office of Science and Technology Policy 2023 Office of Science and Technology Policy. Ai bill of rights. https://www.whitehouse.gov/ostp/ai-bill-of-rights/, 2023. Accessed: 02/2024.
- Guha et al. 2023 Neel Guha, Christie Lawrence, Lindsey A Gailmard, Kit Rodolfa, Faiz Surani, Rishi Bommasani, Inioluwa Raji, Mariano-Florentino Cuéllar, Colleen Honigsberg, Percy Liang, et al. Ai regulation has its own alignment problem: The technical and institutional feasibility of disclosure, registration, licensing, and auditing. George Washington Law Review, 11 2023. Forthcoming, Available at SSRN: https://ssrn.com/abstract=4634443.
- Yang et al. 2024 Ziyi Yang, Shreyas S Raman, Ankit Shah, and Stefanie Tellex. Plug in the safety chip: Enforcing constraints for llm-driven robot agents. In International Conference on Robotics and Automation, 2024.
- Chen et al. 2021 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv:2107.03374, 2021.
- Liang et al. 2023c Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023c.
- Schwartz et al. 2023 Sivan Schwartz, Avi Yaeli, and Segev Shlomov. Enhancing trust in llm-based ai automation agents: New considerations and future challenges. In International Joint Conference on Artificial Intelligence, 2023.