Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

[Uncaptioned image] TaskWeaver: A Code-First Agent Framework

Bo Qiao &Liqun Li11footnotemark: 1 &Xu Zhang11footnotemark: 1 &Shilin He11footnotemark: 1&Yu Kang &Chaoyun Zhang &Fangkai Yang &Hang Dong &Jue Zhang &Lu Wang &Minghua Ma &Pu Zhao &Si Qin &Xiaoting Qin &Chao Du &Yong Xu &Qingwei Lin &Saravan Rajmohan &Dongmei Zhang
Microsoft
taskweaver@microsoft.com
Equal Contribution
Abstract

Large Language Models (LLMs) have shown impressive abilities in natural language understanding and generation, leading to their widespread use in applications such as chatbots and virtual assistants. However, existing LLM frameworks face limitations in handling domain-specific data analytics tasks with rich data structures. Moreover, they struggle with flexibility to meet diverse user requirements. To address these issues, TaskWeaver is proposed as a code-first framework for building LLM-powered autonomous agents. It converts user requests into executable code and treats user-defined plugins as callable functions. TaskWeaver provides support for rich data structures, flexible plugin usage, and dynamic plugin selection, and leverages LLM coding capabilities for complex logic. It also incorporates domain-specific knowledge through examples and ensures the secure execution of generated code. TaskWeaver offers a powerful and flexible framework for creating intelligent conversational agents that can handle complex tasks and adapt to domain-specific scenarios. The code is open-sourced at https://github.com/microsoft/TaskWeaver/.

1 Introduction

Large Language Models (LLMs), such as GPT [22, 12], Claude [11], Palm [10], Gemini [24], Llama [26], and Mixtral [16] have demonstrated remarkable capabilities in natural language understanding and generation. These models have been widely used in various applications, including chatbots, virtual assistants, and content-generation systems. There is a growing potential for LLMs to revolutionize the way humans interact with machines, providing a more natural and intuitive experience.

An agent, specifically those that utilize Large Language Models (LLMs) or other AI technologies, is regarded as an autonomous entity that possesses the ability to plan tasks, observe its surroundings, and execute appropriate actions accordingly [27, 30]. Several existing frameworks, including Langchain [5], Semantic Kernel [8], Transformers Agent [9], Agents [35], AutoGen [29], XAgent [25], and JARVIS [4], have endeavored to utilize LLMs for general-purpose task-oriented conversations. Other frameworks such as Open Interpreter [7], Cradle [23], and UFO [32] focus on more specialized tasks such as controlling the computer system. A recent work [13] introduces a Data Interpreter for the multi-agent framework MetaGPT [14] to enhance its power in data analytics. These frameworks enable users to interact with LLM-powered agents by issuing natural language requests and receiving responses in return. Nevertheless, these frameworks possess limitations that constrain their efficacy in handling domain-specific scenarios and data analytics tasks.

One major limitation is that most existing frameworks lack native support to handle rich data structures. LLM-powered agents often need to work with complex data structures, such as nested lists, dictionaries, or data frames, for data analytics applications and many other business scenarios. However, many of existing frameworks struggle to handle these structures efficiently, particularly when it comes to transferring information between chat rounds or across different plugins. In such cases, these frameworks either persist data to disk or encode complex structures as strings or JSON objects in the prompts. Although these approaches are functional, they can lead to impracticality and increased error rates, particularly when dealing with large datasets.

Another limitation of existing approaches is the lack of configuration for incorporating domain knowledge. While these frameworks provide tools and examples for prompt engineering, they fail to offer a systematic way to embed domain-specific knowledge into the planning and code-generation process. Consequently, the limitation makes it challenging to control the planning and code generation process in accordance with specific domain requirements.

Another issue encountered in many existing frameworks is their inflexibility, which hinders the ability to meet the diverse requirements of users. Although plugins can address common needs, they may fall short when it comes to handling ad-hoc queries. Writing a separate plugin for each ad-hoc query is impractical. In these situations, it becomes necessary for the agent to be able to write custom code to execute the user’s query. Therefore, there is a need for a solution that seamlessly integrates plugin execution with custom code execution to address this issue.

To address these limitations, we propose TaskWeaver, a code-first framework for building LLM-powered autonomous agents. The standout feature of TaskWeaver is its ability to convert each user request into executable code, treating user-defined plugins as callable functions. TaskWeaver overcomes the limitations of existing frameworks by providing support for rich data structures, flexible plugin usage, and dynamic plugin selection. It leverages the coding capability of LLMs to implement complex logic and incorporates domain-specific knowledge through examples. Additionally, TaskWeaver has made considerable efforts towards the secure execution of generated code and provides an easy-to-use interface for developers.

In this paper, we present the design and implementation of TaskWeaver, along with several case studies that demonstrate its effectiveness in handling various tasks. Overall, TaskWeaver provides a powerful and flexible framework for building intelligent conversational agents that can handle complex tasks and adapt to domain-specific scenarios.

2 Motivation and Requirements

To illustrate the TaskWeaver approach, let’s consider a real-world use case – conducting anomaly detection on time series data stored in an SQL database. Our goal is to apply a specialized anomaly detection algorithm to this data, which requires two input columns: one of type timestamp and the other of type float. The expected conversation between the user and the AI assistant proceeds as follows:

Refer to caption
Figure 1: Chat between the user and the assistant.

The example demonstrates common actions in a data analysis task, which include obtaining data, performing analysis on the data, and visualizing the results. By examining the steps in this use case, we can identify key requirements imposed by this example:

  • Plugin: TaskWeaver must support invoking custom plugins. In this example, plugins are needed to pull data from the database and implement the specialized anomaly detection algorithm.

  • Rich data structure: TaskWeaver must be capable of handling data in complex structures, such as pandas DataFrame, to perform advanced data processing actions. Data in rich structure should be able to transfer easily from one plugin to another.

  • Stateful execution: TaskWeaver engages in iterative interactions with the user, processing user inputs and executing tasks accordingly. The execution state is preserved throughout the entire conversation session across multiple chat rounds.

  • Reasoning and action (React): TaskWeaver is unaware of the data schema stored in the database prior to reading it. To generate the anomaly detection code, TaskWeaver must first inspect the data schema and then input the corresponding column names into the anomaly detection algorithm.

  • Response in natural language: TaskWeaver consistently responds to the user in human-readable natural language. Generally, the anomaly detection algorithm returns a DataFrame, but TaskWeaver needs to provide a summary of the execution result, such as “11 anomalies were detected”.

  • Code generation: TaskWeaver must generate code to accommodate ad-hoc user demands, which are not covered by the pre-defined plugins. In the example provided, TaskWeaver generates code to visualize the detected anomalies.

  • Incorporating domain knowledge: TaskWeaver should provide a systematic way to incorporate domain-specific knowledge. It would help LLMs make better planning and accurate tool calls, which in turn produces reliable results, particularly in complex domains.

  • Persisting artifact: TaskWeaver should offer a means of saving results, such as DataFrames or images, to persistent storage. Users can download the artifacts via the provided links.

We have so far summarized the requirements from our motivating example. In the following section, we are going to describe the design of TaskWeaver.

Refer to caption
Figure 2: The overview of TaskWeaver

3 An Overview of TaskWeaver

Fig. 2 presents an overview of TaskWeaver, which consists of two key roles: the Planner and the Code Interpreter (CI). The CI consists of a Code Generator (CG), and a Code Executor (CE). The Planner serves as the system’s entry point and interacts with the user. Its responsibilities include: (1) planning – breaking down the user’s request into subtasks and managing the execution process with self-reflection; and (2) responding – transforming the execution result into a human-readable response for the user.

The CI is responsible to generate code snippets for any given task and run them to obtain the execution result. Specifically, the CG generates code for a given subtask from the Planner, considering existing plugins to enable the generated code to incorporate function calls for specific tasks. The examples within the CG guide it, particularly for domain-specific tasks unfamiliar to the LLM. Lastly, the CE is responsible for executing the generated code and maintaining the execution state throughout the entire session.

Refer to caption
Figure 3: Workflow of TaskWeaver

A centralized Memory module maintains the chat history of the current conversation session between the user and TaskWeaver’s internal roles (i.e., the Planner and Code Interpreter). In addition, the Memory also stores useful information for each role, such as the step-wise plans of the Planner, the thoughts and the generated code snippets of the Code Interpreter, etc. These data are usually referred to as short-term memory as they are only relevant to the current conversation. TaskWeaver also has its long-term memory that can be shared across conversations, which will be detailed in the following sections.

Recall that we have motivated our design with an example of pulling data from a database and applying a custom anomaly detection algorithm to the data. We now explain how a sub-task of it is accomplished in TaskWeaver. The workflow is illustrated in Fig. 3. The prompts shown in Fig. 3 is simplified and only for illustration purpose due to the space limitation. The actual ones are much more complicated.

The initial step involves the Planner taking the user query, Code Interpreter (CI) description, and, if provided, planning examples to generate a plan. The CI description outlines its code generation and execution capabilities. To enhance the Planner’s effectiveness in task planning, the CI description lists the available plugins that is callable by CI. The output of the Planner is a step-by-step plan, according to which the Planner phrases the queries and communicates with the CI. The first step (highlighted in Fig. 3) consists of pulling data from the database and describing the data schema.

The Code Generator (CG) prompt delineates its profile and competencies, providing comprehensive definitions of all the relevant plugins. This includes the function name, its description, the arguments it accepts, and what it returns. Additionally, code generation examples may be incorporated into the prompt to steer the code generation process. The output from the CG is a code snippet that executes the sql_pull_data plugin, retrieves the data into a DataFrame, and describes the data schema.

The Code Executor (CE)’s execution result is sent back to the Planner to determine the next step in the plan. In practice, the Planner may modify its original plan if the outcome differs from expectations. In our example, the execution result reveals two columns, namely date and value, in the DataFrame. For the next step, the Planner can either confirm with the user if these columns correspond to the two input parameters ts_col and val_col of the anomaly_detection plugin, or directly proceed to the third step, as it is reasonable to assume that date and value likely represent ts_col and val_col, respectively. Regardless, TaskWeaver must first retrieve the data and understand its schema before making a decision for the second step, which involves a self-reflection process.

As shown in this example, TaskWeaver incorporates a two-layer planning process during the handling of user requests. The first layer consists of the Planner generating a high-level plan outlining the steps required to fulfill the request. Subsequently, in each round, the CI must devise a plan, in terms of chain-of-thought and generated code, to execute the specified step.

3.1 Concepts

We introduce some important concepts in the TaskWeaver system.

Session

A new session is established once the user submits their initial request to the TaskWeaver. This session is terminated if the user intentionally resets the conversation or when a predetermined expiration time is reached following the last interaction. There could be multiple concurrent sessions running in parallel serving different users.

Round

A round commences with TaskWeaver receiving a user request and concludes upon responding to the user. Subsequently, TaskWeaver awaits the user’s next input. Typically, a session consists of numerous rounds. Responses to the user may either involve a message after completing a specific task or TaskWeaver requesting further input to successfully carry out the task.

Post

Within a round, there can be several messages exchanged between the Planner and CI, each of which is referred to as a post. A post contains the text message, such as the request from the Planner to CI for code generation, and may also include other relevant information like the generated code and execution logs, referred to as attachments. Each attachment has a type and its content.

The in-memory data structure of Sessions, Rounds, and Posts is illustrated in Fig. 4.

Example

It is common practice to use in-context learning for guiding LLMs in content generation, which involves adding examples to the prompt to demonstrate the ‘input’s and their correspondent ‘output’s. In TaskWeaver, we have pre-defined two types of examples for the Planner and the CG, respectively. The examples for the Planner contain the user’s request and desired steps of subtasks, while each example of the CG has the task description as input and generated code snippet as output. All examples follow the Rounds and Posts format and save in YAML files.

Plugin

Plugins in TaskWeaver represent user-defined functions that can be invoked in the generated code snippets. Adding new plugins is the major way of extending TaskWeaver’s capabilities, especially in handling complex tasks that require domain-specific knowledge. In TaskWeaver, Plugins are attached to the session, and therefore, different sessions can have different sets of plugins.

Role

In TaskWeaver, a role is conceptualized as an object instance capable of participating in a conversation by implementing a ‘reply’ interface. Within this framework, we identify the Planner and the CI as two pre-defined roles integral to TaskWeaver’s functionality. Notably, TaskWeaver is designed with flexibility in mind, allowing for the incorporation of additional roles. These roles are dynamically identified and integrated into the system during runtime. A distinctive attribute of the Planner role is its ability to communicate directly with the user, setting it apart from other roles such as the CI, which are constrained to interact solely with the Planner. This hierarchical communication model ensures structured and organized interactions within TaskWeaver’s architecture.

Refer to caption
Figure 4: Concurrent sessions serving different users in parallel.

4 Design Considerations

In this section, we discuss the design considerations.

4.1 Code-First Analysis Experience

Python has emerged as the de-facto language for data analysis, and TaskWeaver leverages this popularity by converting user requests into Python programs that run on dedicated processes. Users can issue natural language requests to manipulate the state of the Python process with TaskWeaver, which utilizes popular libraries like numpy, pandas, sklearn, etc. Unlike other frameworks that rely on text or file-based expressions, TaskWeaver utilizes native data structures such as pandas DataFrame. This makes it easy to perform tasks such as pulling data from a database, running machine learning algorithms (e.g., anomaly detection, classification or clustering), summarizing results, and visualizing analysis outcomes.

4.2 Restricted Code Generation

The TaskWeaver approach involves a process of code verification (before code execution) to ensure that the generated code conforms to pre-defined rules and does not violate any safety requirements. This is achieved through the use of configurable rules that can be customized to suit different scenarios. By doing so, users can ensure that the code does not import any forbidden packages or invoke any unsafe functions. TaskWeaver also supports a dedicated plugin-only mode, leveraging the function calling capability of LLMs, where only the plugin functions can be called, and all other generated code is denied.

4.3 Stateful Code Execution

When users make ad-hoc requests for data analysis, it often involves multiple iterations. As a result, TaskWeaver needs to maintain the state of code execution throughout the entire session. This is similar to programming in Python using a Jupyter Notebook, where users type code snippets in a sequence of cells and the program’s internal state progresses sequentially. However, in TaskWeaver, users use natural language instead of a programming language. TaskWeaver converts each user request into one or more code snippets in each round, depending on the specific plan.

4.4 Intelligent Plan Decomposition

The Planner decomposes users’ requests into sub-tasks that are simplified and executable by the CI. We identified three types of dependencies between different sub-tasks: sequential, interactive, and none. Sub-task B is sequentially dependent on sub-task A means that B must run after A is done. Interactive dependency is the same as sequential in the execution order. More than that, it means there is a need for intervention by a human or a LLM between the two sub-tasks. For example, the user instructs TaskWeaver to read a file and follow the instructions in its content. The Planner split the request into 2 steps: (1) read the file content; (2) follow the instructions. The LLM needs to complete the first step to read the content before carrying out the second step. The third type of dependency is none which means there is no dependency there, and they can be conducted in parallel. According to this definition, two steps where one sequentially depends on the other could be merged into one step for code generation because no intervention is required. In TaskWeaver, the Planner involves two-phase planning, where we enforce it to reconsider the generated plan of sub-tasks to merge sub-tasks with sequential dependency. This can prevent the model from splitting the request into too fine-grained steps leading to a prolonged execution process and incurring unnecessary costs by calling the LLM many times.

4.5 Self-Reflection

TaskWeaver is designed with the capacity to rectify errors throughout the planning and code generation stages. When it detects that the outcomes of the preceding steps diverge from the anticipated results, the Planner possesses the capability to reassess and modify its plan, exploring alternative approaches. Furthermore, the CI is equipped to evaluate the results of code execution. Should the code execution process encounter an exception or the code fails to pass the verification, the CI can initiate a re-generation of the code, thereby attempting to correct the code. This iterative process ensures that TaskWeaver maintains a robust and adaptable approach to task execution, increasing the reliability and efficiency of the framework.

4.6 Scalable Plugin Usage

In TaskWeaver, plugins are specialized Python functions used to handle tasks that are either too complex or require specific domain knowledge, thereby eliminating the number of plugins needed since TaskWeaver can already handle general Python code generation. Furthermore, TaskWeaver features dynamic plugin selection. This means that after a user request is received, only the plugins that are relevant to that request are selected from a pool of available plugins. This approach ensures that TaskWeaver uses the most appropriate tools for the task at hand without overloading the prompt with unnecessary functions.

4.7 Incorporating Domain Knowledge

One way to incorporate domain knowledge is by defining custom plugins, as discussed previously. However, for domain-specific tasks, it can be challenging for the LLM to generate the correct code to call the plugins or to make a good plan. To address this, we have introduced an interface in TaskWeaver that allows users to guide the system for such difficult tasks. TaskWeaver enables users to configure examples to teach the LLM how to respond to certain requests. For instance, a conversation history containing step-by-step thoughts and request/response sections can be used as an example. There are two types of examples in TaskWeaver: one is used for planning and the other for code generation. By using examples, TaskWeaver can incorporate domain-specific knowledge and improve the LLM’s ability to generate accurate plans and code for difficult tasks.

4.8 Security and Safety

Executing freely generated code can introduce security risks. For example, a malicious user may ask TaskWeaver to open a file containing security keys, delete system files, or terminate a process. A typical way to mitigate the security risks is to include certain statements in the prompts to the LLM, which is implemented in TaskWeaver. To further prevent these malicious behaviors, TaskWeaver allocates a separate worker process running inside a session-associated docker container. This architectural choice ensures that each worker process operates independently, isolated from processes of the host OS as well as other sessions, thus effectively preventing malicious behaviors.

4.9 Usability

Existing LLM frameworks such as Langchain [5] make it easy to build proof-of-concept demos. However, building a reliable system with these frameworks can be time-consuming due to the large number of components they provide (like a large box of Lego bricks). One of the main goals of TaskWeaver is to make it easy to use. Typically, users only need to customize the plugins to get started. For more difficult tasks, users can customize the examples for code generation and planning. To aid users, TaskWeaver includes various tools to help, such as a tool to convert Python functions into plugins and a tool to save existing conversations as examples. These tools make it easier for users to customize the system. Another key feature of TaskWeaver is its support for multi-tenant. The system implements a session manager to isolate different user sessions, making it easy to serve multiple users as a service.

4.10 Cost Effectiveness

The cost of calling LLMs can be significant, and TaskWeaver addresses this issue by letting different roles be configured with different LLM models. For instance, GPT 3.5 is much cheaper compared to GPT 4. Therefore, for simpler tasks, we may use a cheaper model to reduce the overall cost.

5 TaskWeaver in Detail

We are going to explain each module in one section accordingly.

5.1 Planner

The Planner serves as a crucial component in the TaskWeaver system, where it decomposes requests from users into several sub-tasks and orchestrates capabilities within TaskWeaver to complete the task and report back to the users. From a high-level perspective, the Planner functions as the entry point and controller, managing the entire system. As illustrated in Figure 2, the Planner communicates bidirectionally with each component, sending queries and receiving responses.

As shown in Fig. 3, a typical working routine of the Planner is as follows: The Planner first receives a query from users and then decides to decompose it into multiple sub-tasks. These sub-tasks essentially form a “Initial Plan”, which is generated based on the knowledge of LLMs or enhanced by domain-specific “Examples”. After drafting the initial plan, the Planner is requested to refine this initial plan by considering the dependencies among the sub-tasks as discussed in Sec. 4.4 in a chain-of-thought manner. The Planner may merge multiple sub-tasks into one in its “Final Plan”. Table 1 shows two examples of the Planner’s initial plan and final plan w.r.t the User’s request. In the first example, the initial plan has 4 steps while the first 3 steps are merged into one in the final plan. In the second example, the final plan is the same as the initial plan because the Planner needs to read the file content and then understand what would be the next step.

User request: Load data from a.csv, show the column names, and count the number of rows
Initial Plan Final Plan
1. load a.csv 2. extract and display the column names <sequentially depends on 1> 3. count the number of rows <sequentially depends on 1> 4. report the information to the user <interactively depends on 2,3> 1. load a.csv, extract the column names, and count the number of rows 2. report the information to the user
User request: Read file manual.txt and follow the instructions in it
Initial Plan Final Plan
1. read manual.txt and show its content 2. follow the instructions according to the file content <interactively depends on 1> 3. report the result to the user <interactively depends on 2> 1. read manual.txt and show its content 2. follow the instructions according to the file content 3. report the result to the user
Table 1: Two examples of the initial plans and the corresponding final plans.

After the plan is finalized, the Planner takes action by assigning each sub-task, with phrased queries, to ask the CI to generate a code snippet, leveraging in-domain plugins when necessary. The execution results are sent back to the Planner. Following the ReAct (reasoning and act) design pattern, upon observing the execution results, the Planner may update its plan, request additional information from users, and so on. The process is repeated for subsequent sub-tasks until the entire plan is completed.

When addressing domain-specific scenarios where the LLM’s own knowledge is insufficient, the Planner can incorporate external knowledge with “Examples”. These examples are tailored by scenario developers according to their usage requirements. One example is essentially the chat history between the User, the Planner, and the Code Interpreter, including the plans of the Planner. The Examples could be saved from an online conversation for the purpose of stabilizing the planning process for future requests, or they could be manually prepared. A real planning Example is provided in Appendix B.

5.2 Code Generator (CG)

Code Generator is designed to utilize LLMs to automatically synthesize a Python code snippet based on an incoming request. From a high-level view, CG combines the benefits of both the plugin system and code interpreter, allowing the capability of invoking plugins and generating additional code. Inside the CG, plugins and examples customized by users are leveraged to generate the code.

  1. 1.

    Plugin: In general, plugins can take on various forms such as a web API call, a software module, a customized algorithm, or a deep learning model. Regardless of the form, all variants can be invoked by a function call. Therefore, we have encapsulated the plugins as a Python function within the Code Generator for seamless invocation.

  2. 2.

    Example: To aid the LLMs in adapting to domain-specific scenarios, examples are designed to help guide LLMs to behave by following the examples. Examples act as a guide for the LLMs to follow and improve their performance by providing contextualized examples. This feature ensures that the generated code aligns with the specific requirements of the user’s domain. A real code generation Example is provided in Appendix C.

The CG has the ability to generate code that exclusively calls upon plugins, code that does not use any plugins, or a combination of both. The code that doesn’t depend on plugins is specifically tailored to handle requests that cannot be met using only existing plugins. In the motivating example presented in Section 2, there is a plugin called ‘anomaly_detection’ that can identify anomalies in data. In this scenario, the CG would first invoke the plugin through a single line of code. Subsequently, the CG would generate multiple lines of additional code to visualize the detected anomalies, as requested by the user.

The knowledge of LLMs is often limited, particularly in domain-specific scenarios. In many cases, there are in-domain tools better suited for completing a specific task. Therefore, the Plugin system can enhance LLMs by enabling them to leverage existing tools. The Plugin system consists of the following two components:

  1. 1.

    Plugin Schema LLMs need to understand the capabilities of a plugin, including its arguments and return values. This information is embedded in the plugin schema file (in YAML format) and could be customized by users. The plugin schema comprises the name, metadata, plugin description, arguments, and return values. For each argument and return value, users should provide its name, type, and description. The plugin schema is then supplied to LLMs as part of the prompt to generate the code.

  2. 2.

    Plugin Implementation The plugin implementation defines how each plugin is executed using Python in Code Executor, where the arguments and return values should correspond with the Plugin Schema. Various plugins can be implemented in Python code, such as a domain-specific algorithm, a software module, a deep learning model or a Web API call. It is important to note that LLMs do not need to know the plugin implementation details to generate code. On the contrary, the Plugin Schema is used for code generation.

Appendix D shows an example of the plugin schema and implementation.

Even with explicit instructions added to the prompt, such as restricting the call to plugin functions, prohibiting local file system modifications, or disallowing certain package installations, the LLM may still generate code that disregards these instructions. This necessitates a post-verification process to ensure the generated code is safe to execute. Following code generation, the CG parses the code into an Abstract Syntax Tree (AST) and examines it line by line for any violations. If violations are detected, an error message is reported, and a new code snippet is generated. The CG will retry this process several times; if violations persist, the CG will notify the planner of its failure to generate compliant code.

Code auto-correction shares similarities with the post-verification process. If the generated code fails to execute, the exception message is reported to the CG, prompting it to retry generating the code. The primary difference is that post-verification operates within the CG, while code auto-correction relies on the CE to run the code and report errors. Notably, if the execution fails, the Planner can directly redirect the message to the CG instead of synthesizing the report via the LLM. TaskWeaver permits the code to regenerate up to a maximum of three times.

5.3 Code Executor (CE)

The Code Executor (CE) receives the code generated by the CG, collects dependent modules and plugin definitions, executes it, preserves context information such as logs, and returns an execution result to the Planner. In TaskWeaver, we implement CE based on the Python Jupyter kernel. To prevent interference between different sessions, CE maintains a separate Jupyter process for each session. In our current implementation, we support two execution modes: local and container. In the local mode, the Jupyter kernel is launched as a local process, while in the container model, the kernel is running inside a Docker container. When a session concludes, the Jupyter process is terminated. The execution result contains the following parts which are returned to the planner.

  • Return Code: a successful code execution will return code 0 and otherwise 1.

  • Logs: Logs are generated in two ways: (1) the stdout/stderr output of the program, and (2) log messages recorded using a logging utility within the plugins.

  • Output: This is the output of the Jupyter cell running the generated code.

  • Artifacts: The generated code or the plugin calls may produce artifacts such as a CSV file or an image. The user can download the artifacts via the provided URLs.

If the CE fails to execute the code from the CG, it will report the error logs to the CG, who will attempt to revise the code to fix the issue. All the error information and the failed code are kept in the CG’s conversation history so that the CG is aware of the full execution history and can precisely understand the state of the CE.

5.4 Experiences and Personalization

In practical scenarios, when a user tasks TaskWeaver with solving a complex problem, the system may initially falter. However, with repeated attempts and additional instructions from the user, TaskWeaver can eventually find a solution. A challenge arises when a user presents a similar or identical problem at a later time: TaskWeaver struggles to provide the correct solution promptly because it lacks the capability to remember past experiences. To address this, we have introduced a feature known as experience memory within the Memory module of TaskWeaver.

Through this mechanism, a user can command TaskWeaver to record the chat history. The system then distills ‘experience tips’ from this history, which encapsulate actionable insights about what to do—or what not to do—in response to requests akin to those encapsulated in the tips. These insights are then stored in an ‘experience pool’. When TaskWeaver encounters similar requests in the future, it will draw upon this pool to inform its strategy for planning and code generation. The retrieved experiences are integrated into the prompts used by the Planning and Code Generation (CG) components, thereby enhancing TaskWeaver’s response to subsequent, similar requests.

The utility of the experience memory extends beyond merely aiding in the resolution of challenging problems; it also serves to capture and incorporate user preferences into interactions with the agent. For instance, if a user exhibits a predilection for plotting charts with specific color schemes, this preference is gleaned from the chat history and preserved as an experience tip. Consequently, such personalized preferences are automatically considered in relevant future tasks.

6 Expansion to Multi-Agent Architecture

This section elucidates the rationale and methodologies for transitioning TaskWeaver into a multi-agent system architecture. The shift towards a multi-agent paradigm confers multiple advantages:

  • Modularity: It is advantageous to decompose a complex system into a consortium of agents, where each is tasked with a discrete set of functions. This modularity enhances the manageability and maintainability of the system.

  • Extensibility: The multi-agent architecture allows for the facile addition of new functionalities. One can introduce new agents with the requisite capabilities into the existing framework without necessitating alterations to the core codebase.

There are two principal strategies for incorporating TaskWeaver within a multi-agent environment, as depicted in Figure 5:

  1. 1.

    Agent Collaboration via Plugins or Roles: As shown in Figure 5(a), one strategy involves a TaskWeaver-powered agent invoking other agents through plugins, or encapsulating the functionality of existing agents within newly defined roles. The choice between implementing a plugin or a role is influenced by the specific function provided by the external agent.

  2. 2.

    Integration into an Existing Framework: The alternative strategy, illustrated in Figure 5(b), embeds TaskWeaver-enhanced agents into a pre-existing multi-agent framework. This integration may require the establishment of a coordination mechanism to govern the interactions among the agents.

Refer to caption
Figure 5: TaskWeaver in a multi-agent environment.

7 Evaluation

The evaluation of an LLM-based agent’s performance can be challenging. Current evaluation methods often treat the LLM agent as a function that maps input data to output data. When assessing the agent’s performance on a multi-step task, the evaluation process resembles a chain of calls to a stateful function. Typically, the agent’s output is compared to a ground truth or a reference output to judge its effectiveness. As the agent’s output is in natural language, evaluation is commonly conducted by matching keywords or phrases in the output to the ground truth.

However, this evaluation method has limitations due to its rigid nature. It may struggle to effectively evaluate long and complex outputs, especially when matching keywords is not sufficient. Handling different formats, such as dates or numbers, can pose challenges for the evaluation method. Additionally, the method should ideally exhibit a level of understanding similar to that of a human, allowing for contextual comprehension and interpretation of the output. For instance, when different agents are tasked with the same objective, they may exhibit varying behaviors while still producing correct outputs.

The below example in Table 2 illustrates this point:

Human: What is the weather today?
Agent 1: It is sunny today in New York.
Human: What is the weather today?
Agent 2: Do you want to know the weather in New York today?
Human: Yes.
Agent 2: It is sunny today.
Table 2: Two agents answering the same question.

Compared to Agent 1, Agent 2 asks for confirmation before providing the answer, which requires more interaction with the user. However, both agents provide the correct answer to the question. However, if the evaluation method takes the agent as a function, it may not be able to handle the different behaviors of the agents and consider Agent 2 as incorrect (as the first response does not match the ground truth, e.g., "sunny").

This leads us to propose a more adaptable evaluation approach that introduces two new roles: the Examiner and the Judge. In this approach, for each test case, the Examiner is initially provided with the task description and assumes the responsibility of supervising the conversation with the evaluation target – the agent. The Examiner has the authority to ask questions to the agent and must ensure that the conversation aligns with the task at hand. Additionally, the evaluation agent is permitted to seek clarification on the task by posing questions to the Examiner. Notably, the Examiner is solely responsible for providing the task description and is prohibited from offering any hints or solutions.

Once the evaluation target presents a solution, the Examiner concludes the conversation and forwards the solution to the Judge for evaluation against the ground truth. This method stands in contrast to the traditional evaluation approach, as it effectively mitigates the limitations previously mentioned.

7.1 DataSets

Eval-Cases

TaskWeaver includes a set of test cases specifically designed to verify that it meets our design goals. These test cases cover several aspects, such as plugin usage, code generation, plan decomposition, reasoning and action, stateful execution, security, etc. The test cases also examine common agent skills like web searching and document retrieval. There are a total of 23 test cases.

DS-1000

DS-1000, as introduced in the work by Lai et al. [18], serves as a code generation benchmark specifically tailored to assess the capability of Language Model Models (LLMs) in generating code for data science-related questions gathered from StackOverflow. Each test case in the benchmark comprises a problem description followed by a sample code snippet that requires completion. The sample code can be of two types: either requiring completion at the end of the snippet or requiring insertion within the existing code. Furthermore, each question is linked to a Python package, such as Pandas, Numpy, Scipy, Tensorflow, or Matplotlib, as an evaluation dependency. Notably, to heighten the evaluation’s complexity, the creators of DS-1000 [18] intentionally modified the original questions in various ways.

Upon reviewing DS-1000, it becomes evident that its original design did not specifically cater to evaluating an agent’s performance. Although agents could potentially respond to the task of completing the missing code, their role would be limited to that of a chatbot. To address this limitation, we have undertaken a transformation of the test cases to render them more suitable for evaluating an agent’s performance. Firstly, all test cases have been converted into insertion test cases. Additionally, we have modified the task to involve filling in the blanks within code snippets, executing the code, and subsequently reporting the completed code. To facilitate the code execution requirement, we have filtered out a subset of test cases where the given sample code is not executable. Specifically, test cases with a problem_id greater than 817 have been disregarded, as a majority of cases in this subset contain code snippets that cannot be run 111The sample codes of these cases typically contain expressions like df=load_data()𝑑𝑓𝑙𝑜𝑎𝑑_𝑑𝑎𝑡𝑎df=load\_data()italic_d italic_f = italic_l italic_o italic_a italic_d _ italic_d italic_a italic_t italic_a ( ), which represent the data loading function, but the actual implementation of the function is missing.. The remaining 816 test cases serve as the basis for evaluating the agent’s performance. An example test case after transformation is presented in Appendix E.

InfiAgent-DABench

InfiAgent-DABench, introduced in the work by Hu et al. [15], serves as a benchmark designed to assess agents’ performance on data analytics tasks. The benchmark comprises a total of 258 test cases, each accompanied by an input file in CSV format. Additionally, each test case presents one or more questions related to the data within the file. In our evaluation of the test cases using the TaskWeaver, we have slightly adapted each case to involve the initial task of loading a data file, followed by answering the questions based on the loaded data. An example test case after transformation is presented in Appendix F.

DSEval

DSEval was proposed by Zhang et al. [33] to evaluate the performance of data science agents. It consists of four benchmarks, namely Exercise, SO, LeetCode, and Kaggle. In total, there are 294 problem sets, where each problem set is a Python file containing multiple problems and each problem has a user input question and the groundtruth code. In evaluation, DSEval will execute both groundtruth code and the code generated by an agent, and then compare the execution output, the Python namespace, etc. We follow the design of DSEval framework with some slight changes to the evaluation code to make it fit for TaskWeaver. The Pass Rate (%PASS) is the number of problems passed divided by all problems in the benchmark, and is used as the metric to assess the performance.

7.2 Overview of Evaluation Result

We conducted evaluations on all the test cases from the datasets mentioned earlier. Each test case awards one or several points upon successful completion. As an agent provides solutions, we count the points earned and compute a normalized score ranging from 0 to 1 to represent the test case’s final score. We put TaskWeaver to the test using various LLMs as the underlying model. For each LLM, we performed at least two runs of all test cases and calculated the mean score, which we present as the LLM’s overall performance score. The summarized results are shown in the table below.

Benchmarks GPT3.5 GPT4
Eval-Cases 0.42 0.87
DS-1000 0.40 0.60
InfiAgent-DABench 0.70 0.88
DSEval 0.36 0.72
Table 3: Evaluation Results on Benchmarks

The performance of GPT-4 consistently surpasses that of GPT-3.5, underscoring its enhanced capabilities in comprehending problems, strategizing task execution, and generating code. When evaluated across three external benchmark datasets, both GPT-3.5 and GPT-4 achieve their highest scores on the InfiAgent-DABench. This superior performance can be attributed to TaskWeaver’s proficiency in tasks that involve data loading followed by analytical operations. However, when it comes to the DS-1000 dataset, which is derived from StackOverflow queries, the models face challenges. The ambiguity of the questions and the complexity of evaluating answers, particularly those involving DataFrames, make accurate assessment difficult. This suggests that while GPT-4’s advancements are notable, there is still room for improvement in processing and generating precise answers from less structured and clear data sources.

7.3 Evaluations with the Eval-Cases in TaskWeaver

The detailed evaluation results for the 23 test cases are depicted in Table 4. Although these cases were crafted to assess compliance with our predefined requirements, it is important to note that their successful completion by TaskWeaver is not unequivocally assured. This is attributed to various factors, including the inherent variability in LLM performance. A notable observation from our study is that the performance of a given LLM endpoint is not consistent, exhibiting fluctuations over time and across different host machines. Such variability poses a significant challenge for applications based on LLMs.

In Table 4, the ‘Score’ column enumerates the points accrued across various test cases when utilizing GPT3.5 and GPT4. Meanwhile, the ‘Normalized Score’ column provides a proportional representation of these scores, scaled relative to the maximum possible points for each case. On average, the normalized scores for GPT3.5 and GPT4 stand at 0.42 and 0.85, respectively, indicating a substantial performance differential between the two models.

ID Case Name Score Normalized Score
GPT3.5 GPT4 GPT3.5 GPT4
1 web_search 2 2 1 1
2 web_search_calc 0 3 0 1
3 echo 0 2 0 1
4 sample_code 2 2 1 1
5 shopping_plan 0 2.5 0 0.71
6 save_file 0 5 0 1
7 context_length 0 10 0 0.67
8 rag 1 1 1 1
9 stock_forecasting 3 1 0.75 0.25
10 delete_files 0 1 0 1
11 get_secret_key 1 1 1 1
12 planner_consolidation 0 2 0 0.67
13 stateful 1 1 1 1
14 anomaly_detection 0 4 0 0.67
15 data_processing 0 3 0 1
16 list_files 1 0 1 0
17 auto_plugin_selection 0 2 0 1
18 run_in_container 1 1 1 1
19 file_chain 0 1 0 1
20 response_format 0 1 0 1
21 command_line 1 1 1 1
22 calc_mean 0 1 0 1
23 plugin_only 1 1 1 1
Averge 0.61 2.11 0.42 0.87
Table 4: Combined results from GPT-3.5 and GPT-4

Upon detailed examination of Table 4, it becomes evident that GPT3.5 surpasses GPT4 in only two instances, specifically in test cases #9 and #16. In test case #9, which involves predicting stock prices, both GPT3.5 and GPT4 successfully downloaded historical price data. However, only GPT3.5 managed to accurately predict future prices within the evaluation period. For test case #16, which requires listing files in the current working directory, GPT4 did not execute the task, citing an inability to access the local file system, while GPT3.5 completed the task without issue.

7.4 Evaluations with the DS-1000 dataset

Following the transformation process in Section 7.1, the DS-1000 dataset comprises 816 individual test cases. These cases are categorized across five widely-utilized libraries dedicated to data analytics tasks, with the distribution illustrated in Fig. 7. To modify the complexity of the original test cases, the developers of the DS-1000 have introduced various perturbations. The variety of these perturbation types and their respective frequencies within the test cases are depicted in Fig. 7.

Refer to caption
Figure 6: Percentage of test cases in different libraries.
Refer to caption
Figure 7: Percentage of test cases in different perturbation types.

Figure 9 presents the normalized scores for various libraries when processed by two different Large Language Models (LLMs), specifically GPT4 and GPT3.5. On average, GPT4 attains a normalized score of 0.6, surpassing the average score of 0.4 achieved by GPT3.5. An analysis of performance across the five libraries indicates a consistent pattern for both models, with Matplotlib recording the highest normalized score and Pandas the lowest. In light of the lower scores for Pandas, we conducted a closer examination of the cases where the models underperformed. A common issue identified was the challenge in accurately determining the correctness of Pandas test cases, which often require a comparison between two DataFrames. Such comparisons are prone to various errors, including discrepancies in indexes or data types. Nevertheless, these mismatches might not always signify actual failures in a practical context, as they are generally simple to rectify and might not pose a real problem.

Refer to caption
Figure 8: Normalized scores of test cases in different libraries.
Refer to caption
Figure 9: Normalized scores of test cases in different perturbation types.

Figure 9 displays the normalized scores associated with various perturbation types. It is observable that the performance of GPT3.5 deteriorates as the complexity of the perturbations escalates, which corroborates our initial predictions. Conversely, GPT4’s performance does not mirror that of GPT3.5, particularly in the context of Semantic perturbations. This divergence can be attributed to GPT4’s enhanced semantic comprehension abilities, which enable it to handle such perturbations more adeptly than GPT3.5, suggesting that these modifications do not pose the same level of difficulty for more advanced models.

7.5 Evaluations on the InfiAgent-DABench dataset

All test cases in the InfiAgent-BABench are about analyzing the data loaded from a CSV file and then answering one or more questions. Figure 11 shows the distribution of the number of scoring points of the questions in this dataset. Most of the test cases have less than 3 questions. We conducted the evaluations with the test cases and collected the scoring points gained by the agent, one scoring point for each question. Then, we calculated the normalized score for each test case, i.e., a value between 0 to 1. Figure 11 shows the normalized scores of test cases with different scoring points. The average normalized scores of GPT4 and GPT3.5 are 0.88 and 0.70, respectively.

Refer to caption
Figure 10: Number of test cases with different scoring points.
Refer to caption
Figure 11: Normalized scores of test cases with different scoring points.

7.6 Evaluations on DSEval dataset

Dataset %PASS %PASS w/o PE %PASS w/o IV %PASS w/o PE & IV
Leetcode 42.50 60.00 42.50 60.00
75.00 77.50 75.00 77.50
SO 37.13 46.04 42.08 50.99
70.79 77.23 73.76 80.20
Kaggle 4.80 6.31 5.81 7.32
35.35 41.67 44.19 50.51
Exercise 20.32 25.67 21.93 27.27
63.64 75.40 66.84 78.61
Table 5: Comparison of GPT-3.5 and GPT-4 results on DSEval Benchmarks.

In this experiment, we conducted a comparative study of GPT-3.5 and GPT-4 across multiple datasets on DSEval benchmark, revealing significant enhancements in GPT-4’s performance. The %PASS denotes the pass rate where TaskWeaver’s execution result is exactly the same as the groundtruth one. Comparing the executions of two programs is often very hard. Sometimes, although the agent has fulfilled the user’s question, it does not imply that the final results are the same and comparable. Two common errors might lead to this situation, Presentation Errors (PE) and Intact Violation (IV). Presentation Errors manifest when the output is nearly correct but is flawed in terms of formatting or the method of presentation. For instance, an agent might neglect to capitalize a column name as required or mistakenly direct results to the console rather than placing them within the designated cell outputs. Intact Violations occur when the solution closely approaches correctness but fails to maintain the data’s original state, often seen when intermediate computations are necessary, and the agent alters the original dataset unnecessarily.

The comparative outcomes, as delineated in Table 5, underscore the substantial strides made by GPT-4 in diminishing the incidence of Presentation Errors (PE) and Intact Violations (IV) across all examined datasets. Notably, within the Leetcode dataset, there was a pronounced increase in the correctness score from 42.50 for GPT-3.5 to 75.00 for GPT-4. Moreover, there was a notable enhancement in the metric representing results unmarred by PE or IV, escalating from 60.00 to 77.50. This pattern of improvement was consistent across additional datasets, including Stack Overflow (SO). While the DSEval benchmark offers a framework for performance evaluation by furnishing ground truth code, it is not without limitations; the rigidity of automated comparisons can lead to undervaluation of the AI’s performance, as it may fail to recognize solutions that, from a human perspective, have effectively fulfilled the task requirements. This highlights the need for a more nuanced evaluation approach that can accurately reflect the practical utility of AI-generated code.

8 Case Studies

In this section, we demonstrate how to use TaskWeaver for practical tasks:

  • Anomaly detection based on data pulled from a database

  • Stock price forecasting

8.1 Task 1: Anomaly Detection

We aim to identify anomalies within a time series dataset stored in an SQL database (sqlite3). To accomplish this goal, we need to integrate the two plugins for data retrieval and time series anomaly detection:

  • pull_data_sql(nl_query) \rightarrow sql, dataframe: Given a natural language query, pull data from a database and return the result in a pandas DataFrame, along with the generated SQL query.

  • anomaly_detection \rightarrow dataframe with anomalies, results description: This plugin utilizes a straightforward 3-sigma algorithm to identify any abnormal data points in a time series. These data points deviate from the mean value by more than three standard deviations.

Note: There is no need to provide plugins for general tasks, such as reading or writing files, as the TaskWeaver is capable of generating the necessary code for these tasks. This is an advantage of TaskWeaver that reduces the development effort to bootstrap applications.

Fig.12 in Appendix A illustrated the detailed plan generated by the Planner of TaskWeaver:

  1. 1.

    Instruct CodeInterpreter to pull data from the time_series table in the database.

  2. 2.

    Confirm the columns to be detected anomalies.

  3. 3.

    Instruct CodeInterpreter to detect anomalies on the pulled data.

  4. 4.

    Report the detected anomalies to the user.

In the first step, the Code Interpreter invoked the pull_data_sql plugin function and automatically filled the required parameters.

1from typing import Tuple
2import pandas as pd
3
4query = "SELECT * FROM time_series"
5df, description = sql_pull_data(query)
6df, description

After obtaining the time series data from the database, TaskWeaver asked the user to provide the column names as additional information to proceed with anomaly detection. The user then provided the "ts" and "val" column names to detect anomalies. With the additional information confirmed, TaskWeaver proceeded to execute the third step, i.e., detecting anomalies using the anomaly_detection plugin.

1time_col_name = ’ts’
2value_col_name = ’val’
3
4anomaly_df, anomaly_description = anomaly_detection(df, time_col_name, value_col_name)
5anomaly_df, anomaly_description

In the end, 11 anomalies were detected and a response was composed for the user.

8.2 Task 2: Stock Price Forecasting

The user requested TaskWeaver to forecast the Nasdaq 100 (QQQ) index prices for the next seven days with the ARIMA model. The detailed execution process is shown in Fig.13 in Appendix A. Here is the plan generated by the TaskWeaver Planner:

  1. 1.

    Instruct CodeInterpreter to obtain historical QQQ stock price data.

  2. 2.

    Preprocess the data.

  3. 3.

    Train the ARIMA model.

  4. 4.

    Forecast the stock price for the next 7 days.

  5. 5.

    Report the predicted stock prices to the user.

After receiving the first subtask from the Planner, the Code Interpreter proceeded to generate a code snippet for the first step. The Code Interpreter thought to itself, "ProgramApe (internal code name) will generate a code snippet to load QQQ historical data using pandas_datareader library." The resulting code snippet is as follows:

1import pandas_datareader as pdr
2import datetime
3start_date = datetime.datetime.now() - datetime.timedelta(days=365)
4end_date = datetime.datetime.now()
5qqq_data = pdr.get_data_yahoo(’QQQ’, start=start_date, end=end_date)
6qqq_data

Nonetheless, the execution encountered a failure and the following error information was reported:

The execution of the generated python code above has failed
During execution, the following messages were logged:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[1], line 7
4 start_date = datetime.datetime.now() - datetime.timedelta(days=365)
5 end_date = datetime.datetime.now()
----> 7 qqq_data = pdr.get_data_yahoo(’QQQ’, start=start_date, end=end_date)
8 qqq_data
File \path\to\python\Lib\site-packages\pandas_datareader\data.py:80, in
get_data_yahoo(*args, **kwargs)
79 def get_data_yahoo(*args, **kwargs):
---> 80 return YahooDailyReader(*args, **kwargs).read()
...

As a result, the Code Interpreter had to modify the code to resolve this issue.

1# Import yfinance or install it if it’s not available
2try:
3 import yfinance as yf
4except ImportError:
5 !pip install yfinance
6 import yfinance as yf
7
8# Load the historical price data for QQQ
9qqq_data = yf.download(’QQQ’)
10qqq_data

In the aforementioned code, we can observe that TaskWeaver employed an alternative approach to search for stock price data using the yfinance library. This time, the historical QQQ stock price data was successfully downloaded and the results were forwarded to the Planner for further instructions.

The Planner received the downloaded stock price data and proceeded with the following steps of preprocessing and ARIMA model training. Ultimately, TaskWeaver utilized the fitted ARIMA model to predict the stock prices for the next 7 days.

9 Design Verification

In this section, we use a set of tasks to verify that the TaskWeaver meets our requirements and design considerations.

9.1 Planning

Task 1: React

ReAct[31] merges Reasoning and Acting for LLMs, providing a comprehensive paradigm for generating verbal reasoning traces and actions to accomplish a given task. By leveraging this approach, the system can conduct real-time reasoning to create, update, and adapt plans while observing the external environments, to incorporate additional information into the reasoning process.

To evaluate the ReAct capabilities of TaskWeaver, we adopted an approach inspired by the Challenge Task presented in [2]. This involved generating a collection of N𝑁Nitalic_N files with the naming convention file1.txt, file2.txt, …, fileN.txt. Each file contained the instruction "read filei+1𝑖1{i+1}italic_i + 1.txt in the same directory to get the further information." The goal was to initiate the TaskWeaver to begin reading file1.txt and sequentially progress to the subsequent files. The final file in the sequence contained instructions to present a specific key to the user.

The results of the evaluation are illustrated as following:

Human: read the file_A.txt and follow the instructions in it
......
>>> [PLANNER->CODEINTERPRETER]
Please read the content of the file file_A.txt
......
>>> [CODEINTERPRETER->PLANNER]
The execution of the generated python code above has succeeded
The stdout is:
read file_B.txt in the same directory to get the further information.
......
>>> [PLANNER->CODEINTERPRETER]
Please read the content of the file file_B.txt
......
>>> [CODEINTERPRETER->PLANNER]
The execution of the generated python code above has succeeded
The stdout is:
read file_C.txt in the same directory to get the further information.
......
>>> [PLANNER->CODEINTERPRETER]
Please read the content of the file file_C.txt
>>> [CODEINTERPRETER->PLANNER]
The execution of the generated python code above has succeeded
The result of above Python code after execution is: Show the key to user.
The key is 12345.
......
>>> [PLANNER->USER]
The key is 12345.

In the depicted process, the Planner component of TaskWeaver initially broke down the task into two steps and initiated the execution of the first step. This initial phase involved reading the contents of the file_A.txt. Subsequently, the Code Interpreter received the Planner’s request and generated a Python code snippet designed to extract the content from the specified file. The generated code included an instruction to "read file_B.txt in the same directory to acquire further information."

Upon receiving this instruction, the Planner updated the plan, incorporating the directive to load the next file in the sequence. This iterative process continued until TaskWeaver successfully navigated through the entire chain of files, culminating in the retrieval of the final message contained in the last file. This concluding message held instructions to present the key to the user, signifying the completion of the task.

Task 2: Plan Decomposition

The Planner is capable of breaking down a task into multiple subtasks and delegating them to the Code Interpreter for execution one by one. However, given the cost and efficiency of LLM, it is not practical to invoke LLM multiple times, particularly for some simple tasks. For this reason, we devised a two-phase planning approach that mandates a review of the generated initial plan, for merging sub-tasks with sequential dependencies (refer to Sec.4.4 for more details).

As an illustration, we tasked TaskWeaver with computing the mean and standard deviation of the ’Count’ column in the file "/sample_data/demo_data.csv" and subsequently verifying the presence of any values exceeding 3 standard deviations from the mean. The initial plan generated by the Planner is as follows:

  • 1. Load the data file

  • 2. Calculate the mean and std of the ’Count’ column <sequentially depends on 1>

  • 3. Check if there are any values larger than 3 std from the mean <sequentially depends on 2>

  • 4. Report the result to the user <interactively depends on 3>

It is noteworthy that the first and second steps can be combined into a single step since they can be accomplished in a single snippet of code without any interventions. As a result, the Planner merged them into one: “Instruct CodeInterpreter to load the data file and calculate the mean and std of the ’Count’ column.” By doing so, we were able to reduce the number of LLM calls from 6 to 3 for the data loading and the mean/std value calculation.

9.2 Coding and Execution

Task 1: Plugin-Only Mode

In the TaskWeaver, we offer a plugin-only mode, which exclusively permits the invocation of plugin functions and forbids the generation of arbitrary code. To validate this feature, we activate the plugin-only mode in the configuration file and instruct TaskWeaver to find out the current date time. Since TaskWeaver can only call the plugin functions, the Planner refuses to execute the code due to the constraints.

Human: generate and execute python code to get the current time
......
>>> [PLANNER->USER]
Im sorry, but as a Planner, I do not have the capability to execute Python code. You can run the provided code in your local Python environment to get the current time.
......

Task 2: Stateful Execution

The Code Interpreter is stateful, meaning it maintains execution states and variables within the same session. For example, we initially asked TaskWeaver to display the column names of ./sample_data/demo_data.csv. Subsequently, we instructed TaskWeaver to execute an irrelevant task, such as “generate 10 random numbers.” Afterward, we requested the mean value of the “Count” column in the previously loaded data. The TaskWeaver identified the need to use data from the previous chat round and subsequently delivered an appropriate response.

Task 3: Auto Correction

The Code Interpreter can make mistakes while generating Python code, resulting in execution failures for the executor. To address this, we prompt the Code Interpreter to revise its code based on the reported error information. We asked TaskWeaver to calculate the mean value of sample_data/demo_data.csv. The Code Interpreter initially generated incorrect code due to the absence of the data schema.

1import pandas as pd
2
3data_file_path = ’../../../sample_data/demo_data.csv’
4df = pd.read_csv(data_file_path)
5mean_value = df.mean()
6mean_value

Upon revising the code, it automatically identifies the columns with numerical data types and calculates their average value successfully.

1import pandas as pd
2
3# Load the data file
4data_file_path = ’../../../sample_data/demo_data.csv’
5df = pd.read_csv(data_file_path)
6
7# Calculate the mean value of the loaded data
8mean_value = df.mean(numeric_only=True)
9mean_value

9.3 Safety

Task 1: Preventing File Deletion and Secret Key Leakage

To ensure the safety and security of the execution environment, we have developed a restricted list to prevent certain sensitive operations, which can be customized by developers. When requesting the TaskWeaver to delete a file in the system folder or retrieve the secret key from global environment variables, it declines to execute these tasks, as they are deemed high-risk operations.

10 Related Work

LLM and Prompt Engineering

Recent advancements in natural language processing have been driven by large language models (LLMs) [34] such as GPT [12], GPT-4 [21], Palm [10], and Llama [26]. These models have not only revolutionized the field of natural language processing, but also how humans interact with machines through applications such as ChatGPT. LLMs are pre-trained on a vast amount of text data and then fine-tuned with reinforcement learning from human feedback (RLHF) and Instruction Fine-Tuning (IFT) [12] to improve their response quality. To improve the performance of LLMs on reasoning and decision-making tasks, various prompting engineering methods have been proposed, including Chain-of-Thought (CoT) [28], zero-shot-CoT [17], and ReAct [31]. Some of these approaches have also been applied in TaskWeaver to enhance its performance.

Agent

Recently, LLM-based agents have gained increasing attention. The fundamental concept is to utilize LLMs as the core controller to make human-like decisions by observing the environment, planning, and taking actions[6]. There are generally two types of agent systems: single-agent and multi-agent systems. The single-agent system focuses more on planning, observing, and acting within the single agent’s own capability, including AutoGPT[1] and LangChain Agents[5]. Conversely, the latter is more concentrated on leveraging multiple agents to work collaboratively. Typical examples include BabyAGI[3], MetaGPT[14], AutoGen[29], CAMEL[19], and Multi-agent Debate (MAD)[20]. Our TaskWeaver is a single-agent framework that focuses on converting user requests into code, even for plugin calls.

11 Conclusion

In this paper, we introduced TaskWeaver, a code-first framework for building LLM-powered autonomous agents that addresses the limitations of existing frameworks in handling rich data structures, incorporating domain knowledge, and offering flexibility. TaskWeaver’s standout feature is its ability to convert user requests into executable code while treating user-defined plugins as callable functions. This approach enables the seamless integration of plugin execution with custom code execution, catering to the diverse requirements of users and providing a more intuitive user experience. We presented the design and implementation of TaskWeaver, highlighting its support for complex data structures, flexible plugin usage, and intelligence task planning. We also demonstrated TaskWeaver’s ability to leverage the coding capability of LLMs to implement complex logic and incorporate domain-specific knowledge through examples. Furthermore, we discussed the efforts made towards the secure execution of generated code and the provision of an easy-to-use interface for developers. Through various case studies, we showcased the effectiveness of TaskWeaver in handling different tasks. Overall, TaskWeaver offers a powerful and flexible solution for building intelligent conversational agents. As LLMs continue to evolve and improve, TaskWeaver can facilitate more advanced and sophisticated applications.

References

  • [1] Autogpt. Available at: https://github.com/Significant-Gravitas/AutoGPT. Accessed on [11/22/2023].
  • [2] Autogpt challenge. Available at: https://github.com/Significant-Gravitas/AutoGPT/blob/master/docs/content/challenges/memory/challenge_a.md. Accessed on [11/22/2023].
  • [3] Babyagi. Available at: https://github.com/yoheinakajima/babyagi. Accessed on [11/22/2023].
  • [4] Jarvis. Available at: https://github.com/microsoft/JARVIS. Accessed on [11/22/2023].
  • [5] Langchain. Available at: https://www.langchain.com/. Accessed on [11/22/2023].
  • [6] Llm powered autonomous agents. Available at: https://lilianweng.github.io/posts/2023-06-23-agent/. Accessed on [11/22/2023].
  • [7] Openinterpreter. Available at: https://github.com/OpenInterpreter/open-interpreter. Accessed on [05/08/2024].
  • [8] Semantic kernel. Available at: https://github.com/microsoft/semantic-kernel. Accessed on [11/22/2023].
  • [9] Transformers agents. Available at: https://huggingface.co/docs/transformers/transformers_agents. Accessed on [11/22/2023].
  • [10] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  • [11] Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Benjamin Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861, 2021.
  • [12] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [13] Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Wenyi Wang, Xiangru Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Zongze Xu, and Chenglin Wu. Data interpreter: An llm agent for data science, 2024.
  • [14] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
  • [15] Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu. Infiagent-dabench: Evaluating agents on data analysis tasks, 2024.
  • [16] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024.
  • [17] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  • [18] Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022.
  • [19] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023.
  • [20] Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023.
  • [21] OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  • [22] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding with unsupervised learning. OpenAI Blog, 2018.
  • [23] Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, Börje F. Karlsson, Bo An, and Zongqing Lu. Towards General Computer Control: A Multimodal Agent For Red Dead Redemption II As A Case Study. arXiv preprint arXiv:2403.03186, 2024.
  • [24] Gemini Team. Gemini: A family of highly capable multimodal models, 2024.
  • [25] XAgent Team. Xagent: An autonomous agent for complex task solving, 2023.
  • [26] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [27] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023.
  • [28] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • [29] Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. 2023.
  • [30] Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  • [31] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  • [32] Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, and Qi Zhang. UFO: A UI-Focused Agent for Windows OS Interaction. arXiv preprint arXiv:2402.07939, 2024.
  • [33] Yuge Zhang, Qiyang Jiang, Xingyu Han, Nan Chen, Yuqing Yang, and Kan Ren. Benchmarking data science agents, 2024.
  • [34] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  • [35] Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, et al. Agents: An open-source framework for autonomous language agents. arXiv preprint arXiv:2309.07870, 2023.

Appendix A Case Study Results

Figure 12: Case Study - Task 1: Anomaly Detection
Refer to caption
Figure 13: Case Study - Task 2: Stock Price Forecasting
Refer to caption
[Uncaptioned image]

Appendix B Planning Example

The following is a planning example for the user query of counting the rows of a CSV file. The example consists of 4 posts:

  1. 1.

    This is the post from the user containing the user query.

  2. 2.

    This is the post from the Planner to the Code Interpreter. The post contains 3 attachments that are the initial plan, the plan (final), and the current step in the plan.

  3. 3.

    This is the post from the Code Interpreter to the Planner on the execution result.

  4. 4.

    This is the post from the Planner to the User, i.e., the response after successfully conducting the query.

These 4 posts consist of a round of conversation between the Planner and the User.

1user_query: count the rows of /home/data.csv
2post_list:
3- message: count the rows of /home/data.csv
4 send_from: User
5 send_to: Planner
6 attachment_list:
7- message: Please load /home/data.csv and count the rows
8 send_from: Planner
9 send_to: CodeInterpreter
10 attachment_list:
11 - type: init_plan
12 content: |-
13 1. load the data file
14 2. count the rows of the loaded data <sequential depend on 1>
15 3. report the result to the user <interactive depend on 2>
16 - type: plan
17 content: |-
18 1. instruct CI to load the file and count the rows
19 2. report the result to the user
20 - type: current_plan_step
21 content: 1. instruct CI to load the file and count the rows
22- message: Load successfully and there are 100 rows
23 send_from: CodeInterpreter
24 send_to: Planner
25 attachment_list:
26- message: The file is loaded and there are 100 rows
27 send_from: Planner
28 send_to: User
29 attachment_list:
30 - type: init_plan
31 content: |-
32 1. load the data file
33 2. count the rows of the loaded data <sequential depend on 1>
34 3. report the result to the user <interactive depend on 2>
35 - type: plan
36 content: |-
37 1. instruct CI to load the file and count the rows
38 2. report the result to the user
39 - type: current_plan_step
40 content: 2. report the result to the user

Appendix C Code Generation Example

The following is a code generation example for the user query of generating 10 random numbers. This example contains 2 posts:

  1. 1.

    This is the post from the user containing the user query.

  2. 2.

    This is the post from the Code Interpreter to the Planner including thoughts, the generated code, and the verification and execution results.

1user_query: generate 10 random numbers
2post_list:
3- message: generate 10 random numbers
4 send_from: Planner
5 send_to: CodeInterpreter
6 attachment_list: []
7- message: The random numbers are 0.2, 0.4, 0.6, ...
8 send_from: CodeInterpreter
9 send_to: Planner
10 attachment_list:
11 - type: thought
12 content: CI will generate 10 random numbers using np.random.
13 - type: python
14 content: |-
15 import numpy as np
16 random_numbers = np.random.rand(10)
17 random_numbers
18 - type: verification
19 content: CORRECT
20 - type: code_error
21 content: No error is detected.
22 - type: execution_status
23 content: SUCCESS
24 - type: execution_result
25 content: The random numbers are 0.2, 0.4, 0.6, ...

Appendix D Plugin Example

The following is an example of the specification of a plugin named ‘anomaly_detection’. It has 3 input arguments (df, ts_col, and val_col), and 2 return values (df and description).

1name: anomaly_detection
2description: >-
3 anomaly_detection function identifies anomalies from an input
4 DataFrame of time series. It will add a new column "Is_Anomaly",
5 where each entry will be marked with "True"
6 if the value is an anomaly or "False" otherwise.
7
8parameters:
9 - name: df
10 type: DataFrame
11 required: true
12 description: >-
13 the input data from which we can identify the anomalies
14 with the 3-sigma algorithm.
15 - name: ts_col
16 type: str
17 required: true
18 description: name of the column that contains the datetime
19 - name: val_col
20 type: str
21 required: true
22 description: name of the column that contains the numeric values.
23
24returns:
25 - name: df
26 type: DataFrame
27 description: >-
28 This DataFrame extends the input DataFrame with a newly-added
29 column "Is_Anomaly" containing the anomaly detection result.
30 - name: description
31 type: str
32 description: a string describing the anomaly detection results.

The following shows the Python implementation of the anomaly_detection plugin.

1def __call__(self, df: pd.DataFrame, ts_col: str, val_col: str):
2 try:
3 df[ts_col] = pd.to_datetime(df[ts_col])
4 except Exception:
5 print("Time column is not datetime")
6 return
7
8 if not is_numeric_dtype(df[val_col]):
9 try:
10 df[val_col] = df[val_col].astype(float)
11 except ValueError:
12 print("Value column is not numeric")
13 return
14
15 mean, std = df[val_col].mean(), df[val_col].std()
16 cutoff = std * 3
17 l, u = mean - cutoff, mean + cutoff
18 df["Is_Anomaly"] = df[val_col].apply(lambda x: x < l or x > u)
19 anomaly_count = df["Is_Anomaly"].sum()
20 desc = f"There are {anomaly_count} anomalies in the data"
21
22 return df, desc

Appendix E DS-1000 Test Case Example

The following list shows a transformed test case from the DS-1000 dataset. We put the original problem description inside the <TASK DESCRIPTION> block and added a static header to the test case that explains the task of completing the sample code, running it, and finally presenting the code back.

1The task is to complete the sample code described in the <TASK DESCRIPTION> block below. Complete the code, run it successfully, and finally present the code back. Please "copy and paste" the following task description in your request to ensure that the task description is correct and complete.
2
3<TASK DESCRIPTION>
4# Problem
5I have the following DataFrame:
6 Col1 Col2 Col3 Type
70 1 2 3 1
81 4 5 6 1
92 7 8 9 2
103 10 11 12 2
114 13 14 15 3
125 16 17 18 3
13
14
15The DataFrame is read from a CSV file. All rows which have Type 1 are on top, followed by the rows with Type 2, followed by the rows with Type 3, etc.
16I would like to shuffle the order of the DataFrame rows according to a list.
17For example, give a list [2, 4, 0, 3, 1, 5] and desired result should be:
18 Col1 Col2 Col3 Type
192 7 8 9 2
204 13 14 15 3
210 1 2 3 1
223 10 11 12 2
231 4 5 6 1
245 16 17 18 3
25...
26
27
28How can I achieve this?
29
30
31# Solution
32The following is the solution code to the problem statement provided above.
33You must complete the code by filling in the missing parts between ### SOLUTION START‘ and ‘### SOLUTION END‘.
34You must keep any code outside of ### SOLUTION START‘ and ‘### SOLUTION END‘ untouched.
35Once you have completed the code, run it to check if your solution is correct.
36Make sure you keep ### SOLUTION START‘ and ‘### SOLUTION END‘ along with your solution code.
37
38
39‘‘‘python
40
41import pandas as pd
42import numpy as np
43
44
45df = pd.DataFrame({’Col1’: [1, 4, 7, 10, 13, 16],
46 ’Col2’: [2, 5, 8, 11, 14, 17],
47 ’Col3’: [3, 6, 9, 12, 15, 18],
48 ’Type’: [1, 1, 2, 2, 3, 3]})
49List = np.random.permutation(len(df))
50### SOLUTION START
51result = ... # put solution in this variable
52### SOLUTION END
53‘‘‘
54
55</TASK DESCRIPTION>

Appendix F InfiAgent-DABench Test Case Example

The following list shows an example of the transformed test case of the InfiAgent-DABench dataset. Each test case involves loading a CSV file and then asking one or more questions concerning the data in the file. The transformation is quite straightforward, only adding the Task section to the original description of the problem.

1# Task
2Load the file test_ave.csv and answer the following questions.
3
4# Question
5Calculate the mean fare paid by the passengers.
6
7# Constraints
8Calculate the mean fare using Python built-in statistics module or appropriate statistical method in pandas. Rounding off the answer to two decimal places.
9
10# Format
11@mean_fare[mean_fare_value] where "mean_fare_value" is a floating-point number rounded to two decimal places.