wissNYF: Tool Grounded LLM Agents for Black Box Setting

Somnath Sendhil Kumar¹ Dhruv Jain¹ Eshaan Agarwal¹ Raunak Pandey¹
¹ Intelligence Group, IIT (BHU), Varanasi

Abstract

While Large Language Models (LLMs) have demonstrated enhanced capabilities in function-calling, these advancements primarily rely on accessing the functions’ responses. This methodology is practical for simpler APIs but faces scalability issues with irreversible APIs that significantly impact the system, such as a database deletion API. Similarly, processes requiring extensive time for each API call and those necessitating forward planning, like automated action pipelines, present complex challenges. Furthermore, scenarios often arise where a generalized approach is needed because algorithms lack direct access to the specific implementations of these functions or secrets to use them. Traditional tool planning methods are inadequate in these cases, compelling the need to operate within black-box environments. Unlike their performance in tool manipulation, LLMs excel in black-box tasks, such as program synthesis. Therefore, we harness the program synthesis capabilities of LLMs to strategize tool usage in black-box settings, ensuring solutions are verified prior to implementation. We introduce TOPGUN, an ingeniously crafted approach leveraging program synthesis for black box tool planning. Accompanied by SwissNYF, a comprehensive suite that integrates black-box algorithms for planning and verification tasks, addressing the aforementioned challenges and enhancing the versatility and effectiveness of LLMs in complex API interactions. The public code for SwissNYF is available at https://github.com/iclr-dummy-user/SwissNYF

1 Introduction

Refer to caption — Figure 1: Illustration of different settings that an LLMs may require to manipulate tools.

Significant advancements in Large Language Models (LLMs) like GPT (Radford et al. (2018); Radford et al. (2019); Brown et al. (2020); Achiam et al. (2023)) and PaLM (Chowdhery et al. (2023);Anil et al. (2023);) have demonstrated profound abilities in reasoning and following instructions over an extensive array of tasks Huang & Chang (2023). The recent shift towards leveraging LLMs to interact with external tools for addressing complex real-world challenges marks a significant area of interest (Hao et al. (2023); Zhang et al. (2023a); Zhuang et al. (2023b); Yang et al. (2023); Schick et al. (2023);Lu et al. (2023a);). In addressing intricate problems, autonomous agents powered by LLMs employ an amalgamation of LLMs and various external tools (APIs), crafting solutions that necessitate a sequence of intermediate reasoning steps (Schick et al. (2023);Lu et al. (2023a);Lu et al. (2023a);Patil et al. (2023);Qin et al. (2023)). When presented with a problem, These agents’ primary objective is to identify and execute a series of API function calls sequentially, leading to a coherent solution. These approaches are ineffective when queries lack transparency or when the APIs are irreversible.

We coin the term "black-box" settings in the context of tool planning as scenarios where the outcomes of an API or tool are not observable. This framework is especially pertinent in systems where using certain APIs poses risks, such as those causing inconsistencies by deleting or updating database entries, canceling jobs, or performing similar operations. It’s also relevant where API experimentation incurs high costs or when APIs require considerable time to execute, ensuring clarity and comprehensive coverage without redundancy, making it challenging to interpret their outcomes. We present a taxonomy of such systems Fig. 1 into three branches:

1.

White Box Systems: In these settings, planners can invoke the API, receive responses, access the source code and understand its complex logic. This access enables the system to navigate complex inputs, intricacies and use cases efficiently.
2.

Gray Box Systems: Planners in these environments have descriptions of the tools at their disposal and the capability to call the API and receive responses. The system’s planning relies solely on the limited descriptions provided and the responses for each tool.
3.

Black Box Systems: In the most challenging scenarios, planners are confined to tool descriptions without access to actual tool outputs. Here, the planner must decipher the dynamics of each tool based solely on its description, making it a particularly demanding task to formulate responses to queries.

The Zhuang et al. (2023a) and Qin et al. (2023) methods excel in straightforward scenarios where an agent can iterate over tools to identify the optimal path, yet they lack efficiency and necessitate extensive exploration. Approaches like Yao et al. (2022) and Parisi et al. (2022), subsets of this exploratory paradigm, offer enhanced efficiency yet frequently falter due to their constrained directionality in tool search, making them suitable predominantly for straightforward API challenges. In contrast, the Zhang et al. (2023b) approach is efficient regarding API execution costs by constraining the number of calls. However, it omits any form of verification for its proposed trajectory, diminishing its precision in practical applications.

These methodologies in tool application present a dichotomy between accuracy and computational overhead. While generally unsuitable for black-box settings, the Reverse chain approach exhibits potential for adaptation within such frameworks. On the other hand, program synthesis-based algorithms have been instrumental in exalting reasoning and decision-making capabilities within LLMs, offering a more naturally associative decision-making process than that afforded by mere text. Works like The Chain of Code Li et al. (2023) and Program-of-thoughts Chen et al. (2022) are great examples of using code generation to improve decision-making for answering general open-domain questions. To this end, few works also upheld the reasoning capability of LLMs using code like "TORA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving" Gou et al. (2023), "Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification" Zhou et al. (2023b) and "PAL: Program-aided Language Models" Gao et al. (2023) have exploited code interpreters for zero-shot verified solving, substantially surpassing few-shot learning benchmarks by enabling semi-verification of proposed solutions.

However, works like Paranjape et al. (2023), which employs code synthesis for tool usage, are restricted by their limited toolset and the scalability challenge posed by the need for extensive human feedback and interventions and the need for the human expert to be familiar with the whole toolset. Similarly, works such as Xu et al. (2023), which deploys language models for real-time code generation and command execution within controlled environments, are limited by their narrow tool range and a deficit in generalizability. The state-of-the-art approaches on HumanEval Chen et al. (2021) and HumanEval-X Zheng et al. (2023) datasets for code generation, like Reflexion Shinn et al. (2023) and LATS Zhou et al. (2023a), which iterate upon code based on interpreter outputs and reflect over them, these approaches have yet to be experimented with in other domains associated with LLMs.

To bridge these gaps, we introduce the TOPGUN (Tool Orchestration and Program synthesis for Generalizing over UNknown systems) framework, which unifies code generation, reasoning, and strategic tool planning designed for complex tasks. TOPGUN also verifies the execution plans and does so with exceptional efficiency in API cost, effectively addressing the limitations of preceding models.

Key contributions of our work are summarized as follows:

1.

To the best of our knowledge, we are the First to coin the term Black Box setting for API usage and developed a suite to encourage the development of algorithms for such scenarios.
2.

We leverage the program synthesis capabilities of Large Language Models (LLMs) to augment their efficacy in tool usage substantially, showcasing a notable enhancement in performance.
3.

We present a robust and cost-efficient framework for scalable solutions across a wide array of open-domain queries, even when faced with limited knowledge of user data/tools. It is also publically hosted to demonstrate the same.¹¹1https://swiss-nyf.azurewebsites.net/

This paper details our methodology and its evaluation by first elucidating the background on Tool planning 2.1 and Code generation using LLM 2.2 followed by detailing individual components of the pipeline 3. Our evaluation is bifurcated into two segments: initially, we undertake a gray box 4.1 across principal datasets, and subsequently, we delve into a black box setting 4.2. For the latter, we have curated a bespoke dataset employing Toolbench prompts, intentionally adjusting the dataset to include only limited documentation of widely used libraries. This adjustment aims to validate the generalizability of our approach. Additionally, we juxtapose our methodology with a tailored variant of the Reverse Chain method to scrutinize performance disparities.

2 Preliminaries

2.1 Problem Formulation

Tool planning within the context of a Large Language Model (LLM), denoted as $\rho$ , involves leveraging a selection of tools from a pool of $n$ candidate tools in the corpus $\mathcal{C}$ , represented as $\mathcal{C}=\{t_{0},t_{1},\ldots,t_{n}\}$ , to effectively address a user’s query $q$ . The primary goal is to formulate a meticulous plan, known as the Solution Trajectory $St$ , for the orchestration of these tools. The Solution Trajectory $St$ , which outlines the sequential execution of tools, is crafted to directly address the query $q$ . The LLM agent, or planner $\mathcal{G}$ , is responsible for planning or generating $St$ from $\mathcal{C}$ , formalized as $St\leftarrow\mathcal{G}(q,\rho,\mathcal{C})$ . This process ensures a structured and coherent response strategy, aligning the tools’ capabilities with the query’s specific requirements for an effective solution.

2.2 Code Generation

The integration of Reflexion Shinn et al. (2023) with Large Language Models (LLM) $\rho$ and Python Interpreter $\mathcal{I}$ has significantly advanced coding tasks by enabling iterative code refinement. This approach leverages feedback $\mathcal{F}$ to iteratively address exceptions and enhance initial code output $c$ , guided by test cases dynamically generated by $\rho$ itself. This ensures comprehensive verification and refinement within a Function Call module, leading to a finalized code $c_{n}$ . This methodology enhances code quality and aligns with contemporary standards, marking a leap in automated code development and verification. This process of iterative code generation can be mathematically denoted as Eq. 1

\begin{gathered}c_{i}\leftarrow\rho(q,\textit{feedback}_{i-1},c_{i-1})\\ \textit{output}\leftarrow\mathcal{I}(c_{i})\\ \textit{feedback}_{i},\textit{verified}\leftarrow\mathcal{F}(output)\\ \end{gathered}

(1)

3 SwissNYF

3.1 Overview

In this section, we introduce SwissNYF, a suite that enables LLM-based agents to efficiently navigate the action space to identify a valid solution for problem-solving in a black box scenario. SwissNYF is composed of five major components i.e., Function Signature Generation $\mathcal{P}$ , Corpus & Retriever $\mathcal{C}$ , Planner $\mathcal{G}$ , Verifier $\mathcal{F}$ and Parser $\mathcal{K}$ as in Fig. 2. We explain individual components of the pipeline in the subsequent subsections.

3.2 Function Signature Generation

Function signatures, conceptualized as pseudo APIs, serve to emulate the behaviour of real API functions based on given tool descriptions. This emulation is crucial for two primary reasons in our tool planning methodology: firstly, they act as stand-ins for actual API calls, thereby enabling LLMs to plan and execute tasks with higher efficiency; secondly, they are treated as pre-defined functions, facilitating the transformation of tool augmentation into a task akin to code generation, using these pseudo functions. These function signatures are distinguished by their docstrings and an example return object that aligns with the tool description, equipping the planner with the necessary means to effectively address user queries. In the context of our SwissNYF implementation, we have adopted a straightforward yet effective method for generating these function signatures, termed CodeSynth. The efficacy of this approach is further analyzed in 4.3.

3.2.1 CodeSynth

For a given set of tool descriptions $t\in\mathcal{T}$ , we direct the Large Language Model (LLM) $\rho$ to generate pseudo-function implementations, denoted as $\hat{t}$ . Our primary objective is to ensure that the arguments and return types of these pseudo-functions remain consistent with their descriptions. Additionally, we craft detailed docstrings for each pseudo-function to facilitate subsequent processes. A critical aspect of CodeSynth is the inclusion of an example return value, which is designed to mimic all potential operations the returned object might undergo during the verification process. The output generated by CodeSynth is illustrated in Fig. 3(a). Moreover, the code generation facilitated by this block benefits from validation through Reflexion, as outlined in Eq. 1. Ultimately, the methodologies applied within CodeSynth can be encapsulated in Algo. 1.

Input:

\rho

: large language model;

T

: tool descriptions;

\mathcal{I}

: python interpreter;

\mathcal{F}(\mathcal{I})

: reflexion feedback of

\mathcal{I}

;

\mathcal{C}

: empty corpus of pseudo tools

for $t=1,2,\cdots,T$ do

\hat{t}_{0}\leftarrow\rho(t)

//Pseudo code

verified\leftarrow\mathcal{I}(\hat{t}_{0})

while not verified do

\hat{t}_{i}\leftarrow\rho(t,feedback_{i-1},\hat{t}_{i-1})

feedback_{i},\ verified\leftarrow\mathcal{F}(\mathcal{I}(\hat{t}_{i}))

Update

\mathcal{C}\leftarrow\hat{t}_{n}

// Update Corpus

Output: A corpus of verified psuedo functions

\mathcal{C}

Algorithm 1

\mathcal{P}

: CodeSynth

Utilizing the Function Calling module alongside the Interpreter, we rigorously test the pseudo-functions against a wide range of real-world scenarios. This approach guarantees that the test cases are comprehensive and reflective of actual function usage, allowing us to gather detailed feedback on the pseudo-functions’ performance. Such feedback is vital for the iterative improvement of the pseudo-functions, significantly enhancing their reliability and applicability in practical settings. Prompts for CodeSynth can be documented in A.1.

3.3 Corpus and Retriever

The function signatures, crucial components of our methodology, are systematically stored within a corpus for future utilization by any planning system. This corpus facilitates the indexing of tool descriptions, enabling the precise retrieval of the most appropriate tool based on the index. Notably, the literature documents several advanced retrieval systems designed for this purpose, demonstrating exceptional accuracy. These include ToolBench IR Qin et al. (2023), APIRetriever Zan et al. (2022), Instructor-XL Su et al. (2022), and GEAR Lu et al. (2023b). Our framework incorporates these retrievers, with Instructor-XL set as the default option, owing to its proven efficacy. Furthermore, we are actively exploring the integration of AnyTool’s Hierarchical API Retriever Du et al. (2024), anticipating significant enhancements to our tool retrieval capabilities. This strategic inclusion of multiple retrievers ensures our system remains versatile and effective in identifying the most suitable tools for a given task, aligning with the latest advancements in retrieval technology.

3.4 Planner

We have implemented two planning approaches in our framework. The first leverages a modified Reverse Chain Zhang et al. (2023b) to support multiple end function calls by decomposing tasks into subtasks and creating sub-trees with the original reverse chain technique. The second, TOPGUN, is our proposed code-driven planning algorithm, designed for speed, efficiency, consistency, and accuracy, especially in black box scenarios. TOPGUN offers a streamlined alternative to traditional planning methods, optimizing for complex system navigation and task execution with greater reliability and cost-effectiveness.

3.4.1 TOPGUN

Input: q: query;

\rho

: large language model;

T

: tool descriptions;

\mathcal{I}

: python interpreter;

\mathcal{F}(\mathcal{I})

: reflexion feedback of

\mathcal{I}

;

\mathcal{C}

: empty corpus of pseudo tools;

\mathcal{P}

: Codesynth,

\mathcal{K}

: parser

Initialize

\mathcal{\hat{T}}\leftarrow\mathcal{P}(\rho,T,\mathcal{I},\mathcal{F},\mathcal% {C})

// Pseudo tools

c_{0}\leftarrow\ \rho(q,\mathcal{\hat{T}},\mathcal{C})

// Code for query

verified\leftarrow\mathcal{I}(c_{0},\mathcal{\hat{T}})

// Verify with pseudo tools while not verified do

c_{i}\leftarrow\ \rho(q,\mathcal{\hat{T}},\mathcal{\hat{C}},feedback_{i-1},c_{% i-1})

feedback_{i},\ verified\leftarrow\mathcal{F}(\mathcal{I}(c_{i},\mathcal{\hat{T% }}))

\mathcal{S}t\leftarrow\mathcal{K}(c_{n})

// Solution Trajectory

Output: A solution trajectory

\mathcal{S}t

and

c_{n}

code for execution and evaluation

Algorithm 2

\mathcal{G}

: TOPGUN

TOPGUN, an acronym for Tool Orchestration and Program synthesis for Generalizing over UNknown systems, redefines the approach to addressing user queries $q$ by framing the challenge as a task of code generation. Utilizing pseudo-functions $\mathcal{\hat{T}}$ as functions available to TOPGUN enables the agent to construct an accurate sequence of function calls $c_{0}\leftarrow\rho(q,\mathcal{\hat{T}},\mathcal{C})$ , effectively depicted in Fig. 3(b). Leveraging Reflexion detailed in Eq.1, the framework iteratively refines responses to the query. The synthesis of these components into the comprehensive algorithm is presented in Algo. 2 showcases TOPGUN’s capability to navigate through various solution paths. Unlike traditional traversal-based techniques, TOPGUN capitalizes on the inherent code-generation capabilities of LLMs, facilitating a more direct and efficient solution process. This distinction not only enhances efficacy by pinpointing issues with precision but also ensures adaptability in black box scenarios, simultaneously optimizing performance in gray box settings. A detailed pipeline overview with TOPGUN in place is given in Fig.3(b). With prompts documented in A.1.

3.5 Verifier

Verification is closely linked to the functionality of the Planner $\mathcal{G}$ , relying on both the nature of $\mathcal{G}$ ’s output and its ability to incorporate feedback. Although verification initially serves as a preparatory step prior to parsing, it also plays a crucial role in refining outputs by providing feedback that $\mathcal{G}$ can use for subsequent iterations.

In our framework, we leverage Reflexion Shinn et al. (2023), detailed in Eq. 1 and depicted in Algo. 2, to seamlessly integrate verification and feedback within the TOPGUN methodology. This eliminates the requirement for an additional function call module, concentrating instead on directly executing code pertinent to the user query. This approach is illustrated in Fig. 5, providing a visual representation of the concept.

3.6 Parser

The Parser $\mathcal{K}$ , akin to the Verifier $\mathcal{F}$ , is intrinsically dependent on the Planner $\mathcal{G}$ for its functionality. Its pivotal output is a well-defined Solution Trajectory $St$ , mapping out the sequence of tool applications devised to address the query. In employing the Reverse Chain technique, our methodology involves synthesizing individual sub-trees into a singular, comprehensive tree through the capabilities of LLM $\rho$ . The process’s efficacy is markedly improved by the judicious reuse of elements from the individual trees during their amalgamation.

Conversely, for the TOPGUN methodology, we adopt the established Abstract Syntax Tree (AST) paradigm Fischer et al. (2007) to segment the program into fundamental function calls, alongside specifying their arguments and return values. This segmentation is instrumental in constructing a systematic series of tool invocations. This meticulously arranged series, denoted as $St$ , is succinctly formalized as $St\leftarrow\mathcal{K}(c_{n})$ .

The entire pipeline, as depicted in Figure 3, emerges from the integration of various components designed to effectively address user queries through the strategic orchestration of tools within the SwissNYF framework.

Table 1: Win Rate of different Candidate and Reference model over G1 set

Candidate	Reference	G1-Instruction	G1-Tool	G1-Category
T.LLaMA ReACT	ChatGPT ReACT	45.0	42.0	47.5
T.LLaMA DFSDT	ChatGPT ReACT	55.0	55.3	54.5
T.LLaMA DFSDT+Ret	ChatGPT ReACT	62.3	59.0	55.0
ChatGPT DFSDT	ChatGPT ReACT	60.5	62.0	57.3
GPT4 ReACT	ChatGPT ReACT	60.0	58.8	63.5
GPT4 DFSDT	ChatGPT ReACT	67.5	67.8	66.5
GPT4 TOPGUN	ChatGPT ReACT	88.192	87.46	87.15
GPT4 TOPGUN	ChatGPT DFSDT	78.49	77.55	76.24
GPT4 TOPGUN	T.LLaMA ReACT	86.72	82.94	80.80
GPT4 TOPGUN	T.LLaMA DFSDT	81.75	75.51	73.81
GPT4 TOPGUN	T.LLaMA DFSDT+Ret	80.35	77.11	75.39
GPT4 TOPGUN	GPT4 ReACT	82.996	79.956	77.633
GPT4 TOPGUN	GPT4 DFSDT	82.065	73.69	71.14

4 Experiments

Tool planning datasets, while diverse, often fall short in supporting multi-turn and multi-call dialogues, as seen in works by Schick et al. (2023) and Tang et al. (2023), and lack precise evaluation metrics, complicating thorough assessments. Even comprehensive datasets like ToolBench by Qin et al. (2023) struggle with aligning to black-box settings, presenting significant challenges for evaluating tool planning in such scenarios.

Our evaluation employs the ToolBench benchmark Qin et al. (2023) and a specially curated dataset for unchar codebases, assessed in both gray (4.1) and black box (4.2) settings. We benchmark our TOPGUN approach against existing methods using win rate, token count, and success rate. Additionally, we scrutinize CodeSynth’s ( $\mathcal{P}$ ) impact on the Planner’s ( $\mathcal{G}$ ) performance and independently evaluate its ability to generate effective function signatures, acting as pseudo functions, detailed in Section 4.3.

Table 2: Win Rate of different Candidate and Reference model over G2, G3 set and Average over all sets

Candidate	Reference	G2-Instruction	G2-Category	G3-Instruction	Average
T.LLaMA ReACT	ChatGPT ReACT	50.8	41.8	55.0	47.0
T.LLaMA DFSDT	ChatGPT ReACT	68.5	58.0	69.0	60.0
T.LLaMA DFSDT+Ret	ChatGPT ReACT	68.5	60.8	73.0	63.1
ChatGPT DFSDT	ChatGPT ReACT	72.0	64.8	69.0	64.3
GPT4 ReACT	ChatGPT ReACT	65.8	60.3	78.0	64.0
GPT4 DFSDT	ChatGPT ReACT	73.3	63.3	84.0	70.4
GPT4 TOPGUN	ChatGPT ReACT	87.59	78.78	90.05	86.54
GPT4 TOPGUN	ChatGPT DFSDT	81.63	73.07	85.26	78.71
GPT4 TOPGUN	T.LLaMA ReACT	86.24	77.71	93.23	84.61
GPT4 TOPGUN	T.LLaMA DFSDT	78.31	71.80	89.47	78.44
GPT4 TOPGUN	T.LLaMA DFSDT+Ret	83.07	72.92	87.82	79.44
GPT4 TOPGUN	GPT4 ReACT	78.61	73.75	93.68	80.27
GPT4 TOPGUN	GPT4 DFSDT	73.92	71.35	79.25	78.59

4.1 Gray Box Evaluation

To assess the performance of TOPGUN and compare it with other gray box methodologies such as ReACT and DFSDT, we maintain the integrity of our pipeline while adapting the evaluation process to incorporate actual functions in place of pseudo functions within the output solution trajectory. This approach effectively leaves our black box pipeline intact while converting it into a gray box evaluation framework. The necessity of responses and Final answers for evaluation purposes has led us to adopt this hybrid strategy. In practical scenarios, this mirrors the process where a generalist planner delivers a strategy to the client, who then substitutes pseudo-function implementations with their real functions for execution. For this evaluation, we employ ToolBench, as detailed by Qin et al. (2023), and conduct our analysis across all problem categories provided in the dataset. Further elaboration on the precise evaluation methodology and the application of ToolBench is documented in A.2.

Results : Win Rate comparisons for ToolLLaMa-ReACT, ToolLLaMA-DFSDT, ChatGPT-DFSDT, GPT4-DFSDT, and GPT4-TOPGUN against ChatGPT-ReACT and GPT4-TOPGUN are summarized, with averages taken from 7 runs per model pair, detailed in Tables 1 and 2. TOPGUN significantly surpassed ReAct and DFSDT in all categories, achieving win rates of 80.27% versus GPT4-ReACT, 78.59% against GPT4-DFSDT, and 86.54% against ChatGPT-ReACT, showing improvements of 22.54% and 16.14% respectively. These results highlight TOPGUN’s superior ability to create tool plans that align with preference evaluation criteria across various conditions.

4.2 Black Box Evaluation

Utilizing the Data Generation pipeline from Qin et al. (2023), we constructed a black-box scenario dataset featuring 36 LLaMa-Hub LlamaIndex (2023) tools and unique functions from private libraries. Following Zan et al. (2022), we converted Pandas and Numpy into Monkey and BeatNum packages, renaming all internal functions and structures to test planner generalizability without LLM prior knowledge. This dataset, detailed at A.1, focuses on accuracy of the solution trajectory, with each query designed for a single correct path. After manual annotation, it comprises 100 queries and 162 tools, with samples and TOPGUN outcomes at A.3.2 and A.5.2.

Results : The black-box evaluation, featuring TOPGUN and a revised Reverse Chain, utilizes $\mathcal{P}$ function signatures for a comprehensive black-box methodology. TOPGUN surpasses Reverse Chain and undergoes comparison with GPT4-DFSDT and GPT4-ReACT within gray box evaluations, emphasizing output trajectories. Success rates, derived from exact trajectory matches with the ground truth and averaged over ten iterations, are documented in Table 4. Figure 6 details the Average Token usage for each algorithm per query, underscoring TOPGUN’s effectiveness and efficiency in generating precise and resourceful tool plans in black-box scenarios, demonstrating its adaptability across diverse datasets.

Note: A black-box evaluation using ToolBench is infeasible, as ToolEval’s metrics, such as pass rate and win rate, rely on intermediate tool responses and the final answer.

4.3 CodeSynth Evaluation

To assess the quality of function signatures produced by CodeSynth, we adopt neuro-symbolic representations, as proposed by Parisotto et al. (2017) and Nye et al. (2021). These representations aim to capture the abstract semantic essence of a given program, aligning well with our objectives. Our evaluation spans the Python subset of HumanEval-X Zheng et al. (2023) and MBPP Austin et al. (2021) dataset. Inspired by the semantic probing model introduced by Ma et al. (2023), we construct semantic representations of both synthesized pseudo functions and ground truth code. Utilizing the tree-sitter Brunsfeld et al. (2024) package, we form the Abstract Syntax Tree, focusing our computation of the F1 score exclusively on the Function Definition block while excluding the body block. Hence, the final metric is precisely representative of our objective with CodeSynth. The appendix A.4.1 can be referred to for function signature examples synthesized with the HumanEval-X dataset.

Results: We evaluate CodeSynth across multiple reflection cycles, tracking the F1 score for each cycle to illustrate consistent enhancements in function signature quality, as depicted in Table 4. CodeSynth significantly improved F1-scores on both HumanEval-X and MBPP datasets, achieving a perfect score of 1.0 by the fifth iteration from initial scores of 0.844 and 0.912, respectively. These findings highlight CodeSynth’s ability to produce function signatures closely resembling the semantics of the target function.

Table 3: Comparison of methodologies in Black Box Setting

Method	Success Rate
GPT4-TOPGUN	70.58
GPT4-DFSDT	61.45
GPT4-ReAct	45.45
GPT4-ReverseChain	43.75

Dataset	F1 Score for max Reflexion Iteration
Dataset	@1	@2	@3	@4	@5
HumanEval-X	0.844	0.894	0.965	0.983	1.00
MBPP	0.912	0.963	0.994	1.00	1.00

Table 3: Comparison of methodologies in Black Box Setting

Table 4: CodeSynth Evaluation for analyzing Reflexions improvement on Function Signature’s AST

5 Conclusion

In this work, we address the challenge of tool planning in black-box settings, where direct access to API calls and their implementations is not feasible, raising concerns about cost efficiency and privacy in API interactions. We introduce SwissNYF, a comprehensive framework designed to equip Large Language Models (LLMs) with the ability to navigate these scenarios effectively. Central to SwissNYF is the ingenious function signature generation that allows the planner to rely on tool descriptions, circumventing the need for actual API executions. We further introduce TOPGUN, a code-driven planning approach leveraging LLMs’ code generation capabilities to offer a robust solution for black-box environments. Our extensive evaluation across various toolsets and settings demonstrates the superior performance of our methodology against traditional tool planning strategies, validating its effectiveness and reliability. Through SwissNYF and TOPGUN, we establish an exciting and emerging paradigm in tool planning, We envision SwissNYF as a central hub for black-box tool usage, encouraging future advancements in developing strategies for black-box scenarios, thus making a significant leap towards efficient, privacy-conscious tool planning in the realm of LLM-enhanced applications.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Brunsfeld et al. (2024) Max Brunsfeld, Andrew Hlynskyi, Amaan Qureshi, Patrick Thomson, Josh Vera, Phil Turnbull, Timothy Clem, Douglas Creager, Andrew Helwer, dundargoc, Rob Rix, Daumantas Kavolis, Hendrik van Antwerpen, Michael Davis, Ika, Tuan-Anh Nguyen, Amin Yahyaabadi, Stafford Brunk, Matt Massicotte, and George Fraser. tree-sitter/tree-sitter: v0.21.0-pre-release-1, 2024. URL https://doi.org/10.5281/zenodo.10638807.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Chen et al. (2022) Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
Du et al. (2024) Yu Du, Fangyun Wei, and Hongyang Zhang. Anytool: Self-reflective, hierarchical agents for large-scale api calls. arXiv preprint arXiv:2402.04253, 2024.
Fischer et al. (2007) Gregor Fischer, J Lusiardi, and J Wolff Von Gudenberg. Abstract syntax trees-and their role in model driven software development. In International Conference on Software Engineering Advances (ICSEA 2007), pp. 38–38. IEEE, 2007.
Gao et al. (2023) Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023.
Gou et al. (2023) Zhibin Gou, Zhihong Shao, Yeyun Gong, Yujiu Yang, Minlie Huang, Nan Duan, Weizhu Chen, et al. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023.
Hao et al. (2023) Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554, 2023.
Huang & Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 1049–1065, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.67. URL https://aclanthology.org/2023.findings-acl.67.
Li et al. (2023) Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474, 2023.
LlamaIndex (2023) LlamaIndex. Llamahub, 2023. URL https://web.archive.org/web/20231229215448/https://llamahub.ai/.
Lu et al. (2023a) Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023a.
Lu et al. (2023b) Yining Lu, Haoping Yu, and Daniel Khashabi. Gear: Augmenting language models with generalizable and efficient tool resolution. arXiv preprint arXiv:2307.08775, 2023b.
Ma et al. (2023) Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jie Zhang, Wenhan Wang, and Yang Liu. Are code pre-trained models powerful to learn code syntax and semantics?, 2023.
Nye et al. (2021) Maxwell Nye, Yewen Pu, Matthew Bowers, Jacob Andreas, Joshua B. Tenenbaum, and Armando Solar-Lezama. Representing partial programs with blended abstract semantics. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=mCtadqIxOJ.
Paranjape et al. (2023) Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023.
Parisi et al. (2022) Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
Parisotto et al. (2017) Emilio Parisotto, Abdel rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neuro-symbolic program synthesis. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=rJ0JwFcex.
Patil et al. (2023) Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
Qin et al. (2023) Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Su et al. (2022) Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741, 2022.
Tang et al. (2023) Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023.
Xu et al. (2023) Yiheng Xu, Hongjin Su, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu, Tianbao Xie, et al. Lemur: Harmonizing natural language and code for language agents. arXiv preprint arXiv:2310.06830, 2023.
Yang et al. (2023) Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction, 2023.
Yao et al. (2022) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
Zan et al. (2022) Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, and Jian-Guang Lou. When language model meets private library. arXiv preprint arXiv:2210.17236, 2022.
Zhang et al. (2023a) Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen. Evaluating and improving tool-augmented computation-intensive math reasoning, 2023a.
Zhang et al. (2023b) Yinger Zhang, Hui Cai, Yicheng Chen, Rui Sun, and Jing Zheng. Reverse chain: A generic-rule for llms to master multi-api planning. arXiv preprint arXiv:2310.04474, 2023b.
Zheng et al. (2023) Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 5673–5684, 2023.
Zhou et al. (2023a) Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406, 2023a.
Zhou et al. (2023b) Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023b.
Zhuang et al. (2023a) Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A. Rossi, Somdeb Sarkhel, and Chao Zhang. Toolchain*: Efficient action space navigation in large language models with a* search, 2023a.
Zhuang et al. (2023b) Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang. Toolqa: A dataset for llm question answering with external tools, 2023b.

Appendix A Appendix

A.1 Prompts

CodeSynth prompt for function signature generation

You are a Python code assistant that can generate a
pseudo-Python function given the name, description,
and arguments.

function name: {}
function description: {}

You have to generate a pseudo-Python function that only
contains docstring and a return example object for the
above-given information. Use dummy examples as return
objects.

Maintain the return datatype. Docsrting contains Args and
Returns. Maintain the arguments typing. The arguments are
optional and should be assigned relevant default values
according to their return type.

Only generate the def function itself as instructed above,
no typing imports or other code is needed.

TOPGUN prompt for code-based plan generation


You are a Python code assistant. Today, you are challenged
to generate a Python code for executing a query. You will
be given a list of pseudo functions that you will use in
your Python code to help you in solving the query correctly.

Understand the query properly and use the required
function to solve it.

We have the following pseudo functions:
=====
{}
=====

Let’s start

If the query is {}
Return the python code to execute it with the help of given
functions. Do not use double quotes; only use single quotes.
Always have to the code within ‘‘‘python\n<--Your Code-->\n‘‘‘
Always remember if a function is to input or output an object
assume the object to be a string.

Function Call Prompt for verification

You are a Python code assistant. You are given a function.
For the given function, write an executable function call
using dummy argument values.

Provided Libraries: {}

Details of the provided library can only be fetched using
the query engine tool, feel free to use it.

-You can import the required classes from one of the provided
 libraries, according to the function arguments and documentation.
-If any library is not provided, ignore any imports.
-Do not import {} function for which you generate the
 function call.
-Do not generate any unnecessary import statements.
-No print statements are needed.
-Always have to code within ‘‘‘python\n<--Your Code-->\n‘‘‘

Example:

Given Function:
  def add(a: int, b: int) -> int:
      ’’’
      Given integers a and b,
      return the total value of a and b.
      ’’’
      return a + b

Function Call:
  a = 1
  b = 4
  add(a, b)

The function name is: {}
The function description is: {}
The Function is: {}
Function Call:

Self-Reflection Prompt

You are a Python code assistant. You will be given your last
Python code implementation, and an error in your last
implementation will be provided. Taking the error into
account, refactor your Python code.

Use the query engine to export the information needed
to resolve.

Always have to code within ‘‘‘python\n<--Your Code-->\n‘‘‘

Previous python code implementation: {}
Self-reflection: {}

Refactored Python code:

CodeSynth prompt for function signature generation on PrivateEval

You are a Python code assistant that can generate a pseudo
Python function given its name, description, and arguments.

function name: {}
function description: {}
Provided Libraries: {}

Always remember to import the required classes from one of the
provided library, according to the function arguments and the
provided documentation.

Documentation is to be fetched using the query engine tool.

If any library is not provided, ignore any imports.

The function arguments and returns are clearly defined in the
function description. Use as provided in the description.

You have to generate a pseudo-Python function that only contains
docstring and a dummy return object matching the actual return
datatype. No need to use the provided arguments. Just return a
dummy object that matches the actual return datatype of the
function.

Maintain the actual return datatype in the return object.
Docsrting contains Args and Returns. Maintain the arguments
typing.

Only generate the def function as instructed above; no typing
imports or other code is needed.

Always have to the code within ‘‘‘python\n<--Your Code-->\n‘‘‘

Pseudo Function:

TOPGUN prompt for code-based plan generation on ToolBench

You are a Python code assistant. Today, you are challenged
to generate a Python code for executing a query. You will
be given a list of pseudo functions that you will use in
your Python code to help you in solving the query correctly.
Understand the query properly and use the required function
to solve it.

We have the following pseudo functions:
=====
{}
=====

You have to make sure to follow the below guardrails:
 - Do not use double quotes; only use single quotes.
 - You are not allowed to define any functions; you must
   always use the given functions in the code.
 - If in case you end up creating a function, please
   rememeber to have a decorator named @update_traverse_dict
   on them.
 - Do not create a main function script and using
   ’if __name__ == "__main__"’ is strictly prohibited.
 - Always have to the code within ‘‘‘python\n<--Your Code-->\n‘‘‘
 - Always remember to use .get() to fetch values from a
   dictionary or a JSON.
 - Always remember to replace the values in .get() of the
   generated code with a value that matches the description of
   its key and dictionary whose argument it is. Use your
   world knowledge to replace the value with a
   good, real example.
   Example:
     contact = company_info.get(’contact_number’, ’999991999’)
     name = company_info.get(’name’, ’ryanair’)

   Remember to Keep the values inside single quotes ’ ’.
 - This is also required when accessing the value of the list
   use try: except: and in except use a value that matches
   the description of the output.
 - Never use print statements. The user can use the variables
   in the code to infer the code.

You have to remember the following to solve the query:
 - Always remember if a function is to input or output an
   object assumes an object to be a string.
 - Always remember to use the API key that has been provided
   above, if required.

If the query is {}
Return the Python code to execute it with the help of the given
pseudo functions.

Prompt for query generation for PrivateEval

You will be provided with several tools, tool descriptions, all of
each tool’s available API functions, the descriptions of these API
functions, and the parameters required for each API function. Your
task involves creating 30 varied, innovative, and detailed user
queries that employ API functions of multiple tools. For instance,
given three tools ‘azure speech’, ‘wikipedia’, and ‘google search’:
‘azure speech’ has API functions ’speech_to_text’ and
’text_to_speech’, ‘wikipedia’ has API functions ’search_data’ and
’read_search_data’, ‘google search’ has API functions
‘google_search’ and ‘read_google_search’. Your query should
articulate something akin to: ‘I recently found a banana with red
spots inside. Which plant disease is this? Can you find an Wikipedia
article on this and read it out to me.’ This query exemplifies how
to utilize API calls of all the given tools. A query that uses API
calls of only one tool will not be accepted. Additionally, you must
incorporate the input parameters required for each API call. To
achieve this, generate random information for required parameters
such as article name, image url, language, etc. For instance, don’t
merely say ‘example image url’, provide the exact link to a image.
Don’t just mention ‘language’, specify en, fr, it, etc. Don’t refer
to ‘dish’, use a real dish such as ‘lasagna’ instead. The first
twenty of the thirty queries should be very specific. Each single
query should combine API calls of different tools in various ways
and include the necessary parameters. Note that you shouldn’t ask
‘which API to use’, rather, simply state your needs that can be
addressed by these APIs. You should also avoid asking for the
input parameters required by the API call, but instead directly
provide the parameters in your query. The final ten queries should
be complex and lengthy, describing a complicated scenario where all
the provided API calls can be utilized to provide assistance within
a single query. You should first think about possible related API
combinations, then give your query. Related APIs are APIs that can
be used for a given query; those related APIs have to strictly come
from the provided API names. For each query, there should be
multiple related APIs; for different queries, overlap of related
APIs should be as little as possible. Deliver your response in
this format: [Query1: ...., ‘related apis’: [[tool name, api name],
[tool name, api name], [tool name, API name]...],Query2: ......,
‘related apis’:[[tool name,api name],[tool name, api name],
[tool name, api name]...] ,Query3: ......, ‘related apis’:
[[tool name, api name], [tool name, api name],
[tool name, api name]...], ...]

A.2 ToolBench for Gray Box Evaluation

ToolBench is a diverse benchmark spanning over 16k APIs across 49 categories from RapidAPI Hub. It consists of three sets of instructions for tool augmentation evaluation: (1) Single-tool instruction (I1), (2) Intra-category multi-tool instruction (I2), and (3) Intra-collection multi-tool instructions. Such a rich set of APIs and instructions makes it a perfect ground to test our pipeline. ToolBench proposes ToolEval containing the evaluation procedure for this set of instructions. ToolEval designs two evaluation metrics using ChatGPT: (1) Pass Rate, calculated by the proportion of instructions completed within a limited budget; (2) Win Rate, measured by asking a ChatGPT evaluator to select its preference for two solution paths. We focus on Win Rate for the evaluation metric to draw comparisons between TOPGUN and other gray box approaches such as DFSDT and ReAct. ToolEval uses a tree-based representation of the responses to generate solution paths, which are then compared to calculate the win rate.

ToolEval response representation

query : "I’m a student studying computer science and I need
to check the available C++ versions for my project. Can you
fetch the versions using the CPP Code Compiler API?
Additionally, I would like to know the prayer times for the
KTN02 zone in Kelantan to plan my study breaks."

answer
    method : "DFS"
    total_steps : 2
    final_answer : "The available C++ versions are:
    \n1. GCC 5.3.0\n2. Zapcc 5.0.0\n3. GCC 7.2.0\n4.
    GCC 8.1.0\n5. GCC 9.1.0\n6. GCC 11.1.0 The prayer
    times for the KTN02 zone in Kelantan are as follows:\n1.
    Imsak: 05:44:00\n2. Subuh: 05:54:00\n3. Syuruk: 07:08:00\n4.
    Zohor: 13:22:00\n5. Asar: 16:46:00\n6. Maghrib: 19:32:00\n7.
    Isyak: 20:42:00"

answer_details
    role : "tool"
    message
        name : "c_versions_for_cpp_code_compiler"
        arguments : "{}"
        response : "{"error": "", "response": "{’0’: {’id’: 0,
        ’name’: ’GCC 5.3.0’}, ’1’: {’id’: 1, ’name’:
        ’Zapcc 5.0.0’}, ’2’: {’id’: 2, ’name’: ’GCC 7.2.0’}, ’3’:
        {’id’: 3, ’name’: ’GCC 8.1.0’}, ’4’: {’id’: 4, ’name’:
        ’GCC 9.1.0’},’5’: {’id’: 5, ’name’: ’GCC 11.1.0’}}"}"

    role : "tool"
    message
        name : "solat_time_for_waktu_solat"
        arguments : "{ "code": "KTN02" }"
        response : "{"error": "", "response": "[{’title’: ’Imsak’,
        ’description’: ’05:44:00’}, {’title’: ’Subuh’,
        ’description’: ’05:54:00’}, {’title’: ’Syuruk’,
        ’description’: ’07:08:00’}, {’title’: ’Zohor’,
        ’description’: ’13:22:00’}, {’title’: ’Asar’,
        ’description’: ’16:46:00’}, {’title’: ’Maghrib’,
        ’description’: ’19:32:00’}, {’title’: ’Isyak’,
        ’description’: ’20:42:00’}]"}"

We ensure that the code plan generated by TOPGUN precisely aligns with this representation to harness ToolEval for win rate calculation. In our black-box inference phase, we lack the final answer and tool responses. However, we retrieve these values during gray-box evaluation involving actual API calls and populate the representation accordingly.

Black Box Inference output

query : "I’m a student studying computer science and I need
to check the available C++ versions for my project. Can you
fetch the versions using the CPP Code Compiler API?
Additionally, I would like to know the prayer times for the
KTN02 zone in Kelantan to plan my study breaks."

available_tools

answer
    method : "gpt4_topgun"
    total_steps : 2
    final_answer : ""

answer_details
    role : "tool"
    message
        name : "c_versions"
        arguments : "{}"
        response : ""

    role : "tool"
    message
        name : "solat_time"
        arguments : "{’code’: ’KTN02’}"
        response : ""

Gray Box Evaluation output

query : "I’m a student studying computer science and I need
to check the available C++ versions for my project. Can you
fetch the versions using the CPP Code Compiler API?
Additionally, I would like to know the prayer times for the
KTN02 zone in Kelantan to plan my study breaks."

available_tools

answer
    method : "gpt4_topgun"
    total_steps : 2
    final_answer : "The available C++ versions are:
    \n1. GCC 5.3.0\n2. Zapcc 5.0.0\n3. GCC 7.2.0\n4.
    GCC 8.1.0\n5. GCC 9.1.0\n6. GCC 11.1.0 The prayer
    times for the KTN02 zone in Kelantan are as follows:
    \n1. Imsak: 05:44:00\n2. Subuh: 05:54:00\n3. Syuruk:
    07:08:00\n4. Zohor: 13:22:00\n5. Asar: 16:46:00\n6.
    Maghrib: 19:32:00\n7. Isyak: 20:42:00"

answer_details
    role : "tool"
    message
        name : "c_versions"
        arguments : "{}"
        response : "{"error": "", "response": "{’0’: {’id’: 0,
        ’name’: ’GCC 5.3.0’}, ’1’: {’id’: 1, ’name’:
        ’Zapcc 5.0.0’}, ’2’: {’id’: 2, ’name’: ’GCC 7.2.0’}, ’3’:
        {’id’: 3, ’name’: ’GCC 8.1.0’}, ’4’: {’id’: 4, ’name’:
        ’GCC 9.1.0’},’5’: {’id’: 5, ’name’: ’GCC 11.1.0’}}"}"

    role : "tool"
    message
        name : "solat_time"
        arguments : "{ "code": "KTN02" }"
        response : "{"error": "", "response": "[{’title’: ’Imsak’,
        ’description’: ’05:44:00’}, {’title’: ’Subuh’,
        ’description’: ’05:54:00’}, {’title’: ’Syuruk’,
        ’description’: ’07:08:00’}, {’title’: ’Zohor’,
        ’description’: ’13:22:00’}, {’title’: ’Asar’,
        ’description’: ’16:46:00’}, {’title’: ’Maghrib’,
        ’description’: ’19:32:00’}, {’title’: ’Isyak’,
        ’description’: ’20:42:00’}]"}"

We input the solution path representations from TOPGUN and other approaches into ToolEval’s preference test to compute the win rate for each query. These win rates are then averaged across different sets of instructions to determine the average win rate.

A.3 PrivateEval Dataset

Here, we list some examples of tools and queries that we created for PrivateEval.

A.3.1 Tools

Moneky and BeatNum

’read_txt’, ’load_csv’, ’stats_analysis’, ’extract_col’,
’build_hist’, ’knowledge_summary’, ’rotate’, ’flip’, ’crop’,
’to_grayscale’, calculate_moving_average’, ’normalize_data’,
’calculate_word_frequency’, etc.

Llama Hub

’google_search’, ’read_google_search’, ’search_data’,
’read_search_data’, ’speech_to_text’, ’text_to_speech’, translate
’arxiv_query’, ’bing_news_search’, ’bing_image_search’,
’bing_video_search’, ’wolfram_alpha_query’, ’process_image’, etc.

A.3.2 Queries Example

Could you help me load a multilingual dataset? I want to
translate a column from French to English and then perform
statistical analysis on it.

Could you help me find the Chinchilla LLM paper? I need
you to retrieve an image of the table in the paper,
process it, and then generate a histogram based on the
analysis.

Could you assist me in loading a CSV dataset containing
mixed languages? Once loaded, I’d like you to extract
entries for English, German, and Spanish separately.
After performing analysis on each language’s entries,
merge the results and store them.

Please retrieve Tesla stock price data from an online
database. Next, calculate moving averages. Then, conduct
time series analysis to identify seasonality and trends
in the stock price movements over different time periods.
Finally, summarize the findings.

Could you please retrieve some images of dogs? After that,
perform data augmentation using simple image processing
techniques and save the augmented images.

Could you search for papers on "artificial intelligence"
on arXiv? Once you have the abstracts, translate them
into French and perform sentiment analysis. Finally, we’ll
visualize the distribution of sentiments.

Please search for educational podcasts on "quantum physics".
Once you have the podcasts, transcribe the audio content.
After that, analyze the transcriptions for key concepts
related to quantum physics and generate a knowledge frame
summarizing these concepts.

Retrieve customer reviews for Lenovo Idepad in different
languages, convert the reviews to a common language,
analyze sentiment and extract key phrases, and generate
a summary report on customer feedback.

Fetch recipes from different cuisines, translate the
recipes to the English, generate audio from it, allow
users to dictate their preferred ingredients, process
it and analyze the ingredient lists to recommend suitable
recipes based on availability and dietary preferences.

10.

Please search for a lasagna recipe. Once you have it,
translate it from Italian to English. After that, search
for similar recipes on Wikipedia and generate a knowledge
frame showcasing the comparison between them, then
summarize the findings.

11.

Please search for a TED talk speech. Once you have it,
translate it from English to Mandarin. After that,
generate a transcript of the translated speech. Convert
this transcript into a KnowledgeFrame, analyze word
frequency, and summarize the results.

12.

Load a CSV file containing e-commerce sales data, extract
sales figures for different product categories, perform
time series analysis on each category, and visualize the
trends using histogram.

13.

Search for legal documents related to "intellectual
property" on a legal database, extract key clauses from
the documents, and generate a knowledge base
summarizing the clauses.

14.

Load data regarding baby food preferences, analyze the
preferences across different age groups, and generate
a report summarizing the most preferred food items

A.4 CodeSynth Examples

Examples of function signatures and calls generated by CodeSynth while evaluating with HumanEval-X and PrivateEval datasets.

A.4.1 HumanEval-X

(a)
```
Name: intersperse

Description: Insert a number ’delimeter’ between every two
             consecutive elements of input list ‘numbers’
             >>> intersperse([], 4)
             []
             >>> intersperse([1, 2, 3], 4)
             [1, 4, 2, 4, 3]
```
⬇

def intersperse(numbers: List[int], delimeter: int) -> List[int]:

"""

Args:

numbers (List[int]): A list of integers

delimeter (int): An integer to be inserted between every

two consecutive elements of the input

list

\parReturns:

List[int]: A new list with the delimeter inserted between

every two consecutive elements of the input

list

%**** iclr2024_conference.tex Line 1600 ****"""

return [0] # Dummy return object

\par\par# Function Call:

from typing import List

\parnumbers = [1, 2, 3]

delimeter = 4

intersperse(numbers, delimeter)

(b)

Name: pairs_sum_to_zero

Description: pairs_sum_to_zero takes a list of integers
             as an input. it returns True if there are
             two distinct elements in the list that sum
             to zero, and False otherwise.
             >>> pairs_sum_to_zero([1, 3, 5, 0])
             False
             >>> pairs_sum_to_zero([1, 3, -2, 1])
             False

⬇

def pairs_sum_to_zero(l: List[int]) -> bool:

"""

Args:

l (List[int]): A list of integers as an input.

\parReturns:

bool: True if there are two distinct elements in

the list that sum to zero, and False otherwise.

"""

return False

\par\par# Function Call:

l = [2, 4, -5, 3, 5, 7]

pairs_sum_to_zero(l)

\par

(c)

Name: vowels_count

Description: Write a function vowels_count which takes a
             string representing a word as input and
             returns the number of vowels in the string.
             Vowels in this case are ’a’, ’e’, ’i’, ’o’, ’u’.
             Here, ’y’ is also a vowel, but only when it is
             at the end of the given word.

             Example:
             >>> vowels_count("abcde")
             2
             >>> vowels_count("ACEDY")
             3

⬇

def vowels_count(s: str) -> int:

"""

Args:

s (str): A string representing a word.

\parReturns:

int: The number of vowels in the string.

"""

return 0

\par\par%**** iclr2024_conference.tex Line 1675 ****# Function Call:

s = "example"

vowels_count(s)

(d)

Name: prod_signs

Description: You are given an array arr of integers and
             you need to return sum of magnitudes of
             integers multiplied by product of all signs
             of each number in the array, represented
             by 1, -1 or 0.
             Note: return None for empty arr.

             Example:
             >>> prod_signs([1, 2, 2, -4]) == -9
             >>> prod_signs([0, 1]) == 0
             >>> prod_signs([]) == None

⬇

%**** iclr2024_conference.tex Line 1700 ****def prod_signs(arr: List[int]) -> Union[int, None]:

"""

Args:

arr (List[int]): An array of integers.

\parReturns:

Union[int, None]: The sum of magnitudes of integers

multiplied by the product of all signs

of each number in the array, represented

by 1, -1 or 0. Returns None for empty arr.

\par"""

return 0 # Dummy return object

\par\par# Function Call:

from typing import List, Union

\pararr = [1, 2, 2, -4]

prod_signs(arr)

(e)

Name: will_it_fly

Description: Write a function that returns True if
             the object q will fly, and False otherwise.
             The object q will fly if it’s balanced
             (it is a palindromic list) and the sum of
             its elements is less than or equal the maximum
             possible weight w.

             Example:
             will_it_fly([1, 2], 5) -> False

             will_it_fly([3, 2, 3], 1) -> False

⬇

def will_it_fly(q: List[int], w: int) -> bool:

"""

Args:

q (List[int]): A list of integers representing the

object’s weight distribution.

w (int): The maximum possible weight for the object to fly.

\parReturns:

%**** iclr2024_conference.tex Line 1750 ****bool: True if the object will fly, False otherwise.

"""

return True # Dummy return

\par\par# Function Call:

from typing import List

\parq = [3, 2, 3]

w = 9

will_it_fly(q, w)

A.4.2 PrivateEval

(a)
```
Name: stats_analysis

Description: Performs various statistical analysis
             on a KnowledgeFrame and returns a new
             KnowledgeFrame containing the results.

             Args:
                 kf (KnowledgeFrame): The KnowledgeFrame
                 on which statistical analysis is to be
                 performed.

             Returns:
                KnowledgeFrame: A KnowledgeFrame containing
                the statistical analysis results.
```
⬇

def stats_analysis(knowledgeframe):

"""

Performs various statistical analyses on a KnowledgeFrame

and returns a new KnowledgeFrame containing the results.

\parArgs:

knowledgeframe (KnowledgeFrame): The KnowledgeFrame

on which statistical analysis will be performed.

\parReturns:

KnowledgeFrame: A KnowledgeFrame containing the

statistical analysis results.

%**** iclr2024_conference.tex Line 1800 ****"""

return KnowledgeFrame() # Dummy return object

\par\par# Function Call:

from monkey import KnowledgeFrame

\par# Dummy data for the KnowledgeFrame

data = {

’column1’: [1, 2, 3],

’column2’: [4, 5, 6],

’column3’: [7, 8, 9]

}

\par# Create a dummy KnowledgeFrame

knowledgeframe = KnowledgeFrame(data)

\par# Call the stats_analysis function with the dummy KnowledgeFrame

result = stats_analysis(knowledgeframe)
(b)
```
Name: knowledge_summary

Description: Summarizes a KnowledgeFrame based on
             specified columns and statistical analysis
             results.

             Args:
                kf (KnowledgeFrame): The KnowledgeFrame
                to be summarized.

                columns (List[str]): The list of column
                names to include in the summary.

                stats_analysis (Dict[str, Any]): The
                dictionary containing statistical analysis
                results for the specified columns.

             Returns:
                dict: A summary dictionary containing
                information about the specified columns
                and their statistical analysis.
```
⬇

def knowledge_summary(knowledgeframe, columns, stats_analysis):

"""

Summarizes a KnowledgeFrame based on specified columns and statistical analysis results.

%**** iclr2024_conference.tex Line 1850 ****\parArgs:

knowledgeframe (KnowledgeFrame): The KnowledgeFrame to

be summarized.

\parcolumns (List[str]): The list of column names to include

in the summary.

\parstats_analysis (Dict[str, Any]): The dictionary containing

statistical analysis results for the specified columns.

\parReturns:

dict: A summary dictionary containing information about

the specified columns and their statistical analysis.

"""

return {"dummy_key": "dummy_value"}

\par\par# Function Call:

\par# Dummy function call for knowledge_summary

knowledgeframe = {"dummy_key": "dummy_value"}

columns = ["column1", "column2"]

stats_analysis = {"column1": {"mean": 5, "median": 4}, "column2": {"mean": 10, "median": 8}}

\par%**** iclr2024_conference.tex Line 1875 ****knowledge_summary(knowledgeframe, columns, stats_analysis)
(c)
```
Name: to_grayscale

Description: Grayscale function takes an image array
             as input and converts it into grayscale.

             Args:
                image_array (beatnum.bdnumset): Input
                image array to be converted to grayscale.

             Returns:
                beatnum.bdnumset: Grayscale image array.
```
⬇

def to_grayscale(image_array):

"""

Grayscale function takes an image array as input and

converts it into grayscale.

\parArgs:

%**** iclr2024_conference.tex Line 1900 ****image_array (beatnum.bdnumset): Input image array to

be converted to grayscale.

\parReturns:

beatnum.bdnumset: Grayscale image array.

"""

dummy_shape = (1, 1) # Dummy shape for the bdnumset

return beatnum.bdnumset(dummy_shape)

\par\par# Function Call:

from beatnum import bdnumset

\par# Dummy image_array

dummy_shape = (1, 1) # Dummy shape for the bdnumset

image_array = bdnumset(dummy_shape)

\par\par# Function call

to_grayscale(image_array)
(d)
```
Name: flip

Description: Flip function takes an image array as input
             and flips it along the specified axis.

             Arg:
                image_array (beatnum.bdnumset): Input image
                array to be flipped.

                axis (int, optional): Axis along which to
                flip the image array.

             Returns:
                beatnum.bdnumset: Flipped image array.
```
⬇

def flip(image_array, axis=1):

"""

Flip function takes an image array as input and flips

it along the specified axis.

\parArgs:

image_array (beatnum.bdnumset): Input image array to be

flipped.

\par%**** iclr2024_conference.tex Line 1950 ****axis (int, optional): Axis along which to flip the image

array. Default is 1 (horizontal flip).

\parReturns:

beatnum.bdnumset: Flipped image array.

"""

dummy_shape = image_array.shape

return beatnum.bdnumset(shape=dummy_shape)

\par# Function Call:

import beatnum as bn

\parimage_array = bn.bdnumset(shape=(2, 2), dtype=float, order=’F’)

axis = 1

\parflip(image_array, axis)

(e)

Name: translate

Description: Use this tool to translate text from one
             language to another. The source language will
             be automatically detected. You need to specify
             the target language using a two character
             language code.

             Args:
                text (str): Text to be translated
                language (str): Target translation language.
                One of af, sq, am, ar, hy, as, az, bn, ba,
                eu, bs, bg, ca, hr, cs, da, dv, nl, en, et,
                fo, fj, fi, fr, gl, ka, de, el, gu, ht, he,
                hi, hu, is, id, iu, ga, it, ja, kn, kk, km,
                ko, ku, ky, lo, lv, lt, mk, mg, ms, ml, mt,
                mi, mr, my, ne, nb, or, ps, fa, pl, pt, pa,
                ro, ru, sm, sk, sl, so, es, sw, sv, ty, ta,
                tt, te, th, bo, ti, to, tr, tk, uk, ur, ug,
                uz, vi, cy, zu

⬇

def translate(text: str, language: str) -> str:

"""

Translates text from one language to another. The source

language will be automatically detected.You need to specify

the target language using a two character language code.

\par%**** iclr2024_conference.tex Line 2000 ****Args:

text (str): Text to be translated

\parlanguage (str): Target translation language. One of

af, sq, am, ar, hy, as, az, bn, ba, eu, bs, bg, ca,

hr, cs, da, dv, nl, en, et, fo, fj, fi, fr, gl, ka,

de, el, gu, ht, he, hi, hu, is, id, iu, ga, it, ja,

kn, kk, km, ko, ku, ky, lo, lv, lt, mk, mg, ms, ml,

mt, mi, mr, my, ne, nb, or, ps, fa, pl, pt, pa, to,

ru, sm, sk, sl, so, es, sw, sv, ty, ta, tt, te, the,

bo, ti, to, tr, tk, uk, ur, ug, uz, vi, cy, zu

\parReturns:

str: Translated text

"""

return "dummy_translated_text"

\par\par# Function Call:

text = "Hello, how are you?"

language = "fr"

translate(text, language)

A.5 TOPGUN Examples

Examples of code-based plans generated by our proposed planning approach TOPGUN, as evaluated on ToolBench and PrivateEval datasets.

A.5.1 ToolBench

(a)
```
Query: My friends and I are eagerly awaiting the delivery of a
package. Can you please track the package with the Pack & Send
reference number ’ReferenceNumberHere’? Additionally, I’m
interested in the latest status of the package with colis ID
’CA107308006SI’
```
⬇

# Import the required libraries

import requests

\par%**** iclr2024_conference.tex Line 2050 ****# Use the api_tracking function to track the package with

# the Pack & Send reference number

reference_number = ’ReferenceNumberHere’

package_tracking_info = api_tracking(reference_number)

\par# Use the latest function to get the latest status of the

# package with colis ID

colis_id = ’CA107308006SI’

latest_status = latest(colis_id)

\par# The package_tracking_info and latest_status variables

# contain the required information
(b)
```
Query: I’m a movie critic and I need to write reviews for the
latest movies. Can you provide me with a list of new arrivals
on different platforms? It would be great if you could include
the streaming platforms and the genres for each movie.
```
⬇

import requests

%**** iclr2024_conference.tex Line 2075 ****\par# Get the latest arrivals from different platforms

new_arrivals_data = new_arrivals(region=’US’)

\par# Initialize an empty list to store the movie details

movie_details = []

\par# Iterate through the new arrivals data

for movie in new_arrivals_data.get(’result’, []):

# Get the IMDb ID of the movie

imdb_id = movie.get(’imdbid’, ’’)

\par# Get the basic information of the movie using the IMDb ID

title_data = title_details(imdbid=imdb_id)

\par# Extract the required information from the title data

movie_title = title_data.get(’title’, ’’)

streaming_platforms = title_data.get(’platforms’, {})

genres = title_data.get(’genre’, ’’)

\par# Append the movie details to the movie_details list

movie_details.append({

’title’: movie_title,

’streaming_platforms’: streaming_platforms,

’genres’: genres

})

\par# The movie_details list now contains the new arrivals

# along with their streaming platforms and genres
(c)
```
Query: I’m hosting a virtual movie night with my friends and
I need some suggestions. Can you search for videos related to
’action’ on Vimeo? Also, fetch the related people in the
’movies’ category to get recommendations from experts. Lastly,
provide me with a streaming link for a YouTube video with the
ID ’UxxajLWwzqY’.
```
⬇

import requests

\par# Search for videos related to ’action’ on Vimeo

action_videos = searchvideos(format=’json’, query=’action’, sort=’relevant’)

\par# Fetch the related people in the ’movies’ category

related_people = getrelatedpeople(category=’movies’, format=’json’)

%**** iclr2024_conference.tex Line 2125 ****\par# Provide a streaming link for a YouTube video with the ID

# ’UxxajLWwzqY’

youtube_streaming_link = download_stream(is_id=’UxxajLWwzqY’)
(d)
```
Query: I am a fitness enthusiast and I want to buy a fitness
tracker. Can you suggest some top-rated fitness trackers
available on Amazon along with their features and prices?
```
⬇

import requests

\par# Use the search function to find top-rated fitness trackers

# on Amazon

search_results = search(type=’search’, search_term=’fitness tracker’, amazon_domain=’amazon.com’, sort_by=’average_review’, exclude_sponsored=True)

\par# Extract the top 5 fitness trackers from the search results

top_5_fitness_trackers = search_results.get(’results’, [])[:5]

\par# Get the ASINs of the top 5 fitness trackers

%**** iclr2024_conference.tex Line 2150 ****top_5_asins = [tracker.get(’asin’, ’’) for tracker in top_5_fitness_trackers]

\par# Retrieve the product details for each of the top 5 fitness

# trackers

top_5_product_details = [product(type=’product’, asin=asin, amazon_domain=’amazon.com’) for asin in top_5_asins]

\par# Extract the features and prices of the top 5 fitness trackers

top_5_features_and_prices = []

for product_detail in top_5_product_details:

try:

features = product_detail.get(’features’, [])

except:

features = []

price = product_detail.get(’price’, {}).get(’value’, ’N/A’)

top_5_features_and_prices.append({’features’: features, ’price’: price})

\par# The top_5_features_and_prices variable contains the features

# and prices of the top 5 fitness trackers on Amazon
(e)
```
Query: I’m a cryptocurrency trader and I want to analyze the
historical prices and market caps of popular cryptocurrencies
like Bitcoin, Ethereum, and Stellar. Can you fetch this
information for me using the Crypto Prices API? Additionally,
I’m planning a trip to North America and I would like to know
the subregions in North America using the Geography API.
```
⬇

import requests

\par# Fetch cryptocurrency data

crypto_data = prices_and_up_and_down()

\par# Fetch subregions of North America

subregions_data = get_sub_regions(’North America’)

\par# Accessing specific cryptocurrency data

bitcoin_data = crypto_data.get(’Bitcoin’, {})

ethereum_data = crypto_data.get(’Ethereum’, {})

stellar_data = crypto_data.get(’Stellar’, {})

\par# Accessing subregions of North America

try:

north_america_subregions = subregions_data.get(’subregions’, [])

except:

north_america_subregions = []

%**** iclr2024_conference.tex Line 2200 ****\par# You can now analyze the cryptocurrency data and plan your

# trip to North America using the subregions information.
(f)
```
Query: I need to find a tutorial on how to draw landscapes.
Please provide me with the details of the most viewed
landscape drawing tutorial video. Additionally, I would like
to know the details of the channel that uploaded the video.
```
⬇

import requests

\par# Search for landscape drawing tutorial videos

search_result = search(’landscape drawing tutorial’, type=’video’, safesearch=True)

\par# Find the most viewed video

most_viewed_video = None

max_views = 0

for video in search_result.get(’items’, []):

views = int(video.get(’statistics’, {}).get(’viewCount’, ’0’))

%**** iclr2024_conference.tex Line 2225 ****if views > max_views:

max_views = views

most_viewed_video = video

\par# Get the video details

video_id = most_viewed_video.get(’id’, ’video_id_example’)

video_details = video(video_id)

\par# Get the channel details

channel_id = most_viewed_video.get(’snippet’, {}).get(’channelId’, ’channel_id_example’)

channel_details = video(channel_id)

\par# The video_details and channel_details variables contain the

# required information

A.5.2 PrivateEval

(a)
```
Query: Could you help me load a multilingual dataset? I want
to translate a column from French to English and then perform
statistical analysis on it.
```
⬇

# Load the multilingual dataset (assuming it’s a CSV file)

csv_file = ’path/to/your/multilingual_dataset.csv’

knowledgeframe = load_csv(csv_file)

\par# Extract the French column (assuming it’s named ’french_column’)

french_column = knowledgeframe(’french_column’)

\par# Translate the French column to English

translated_column = []

for text in french_column:

translated_text = translate(text, language=’en’)

translated_column.append(translated_text)

\par# Add the translated column to the KnowledgeFrame

knowledgeframe[’english_column’] = translated_column

\par# Perform statistical analysis on the translated column

analysis_kf = stats_analysis(knowledgeframe)

%**** iclr2024_conference.tex Line 2275 ****
(b)
```
Query: Load data regarding baby food preferences, analyze
the preferences across different age groups, and generate
a report summarizing the most preferred food items
```
⬇

# Load the CSV file containing the survey data

survey_data = load_csv(’baby_food_preferences.csv’)

\par# Perform statistical analysis on the survey data

analysis_results = stats_analysis(survey_data)

\par# Generate a histogram to visualize the preferences across different age groups

build_hist(analysis_results)

\par# Summarize the most preferred food items in a report

columns_to_include = [’age_group’, ’food_item’, ’preference_score’]

summary_report = knowledge_summary(survey_data, columns_to_include, analysis_results)

\par# The summary_report variable now contains the report summarizing the most preferred food items
(c)
```
Query: Could you help me find the Chinchilla LLM paper? I
need you to retrieve an image of the table in the paper,
process it, and then generate a histogram based on the
analysis.
```
⬇

from monkey import KnowledgeFrame

from typing import List, Dict, Union

\par# Search for "Chinchilla LLM" paper on arXiv

query = "Chinchilla LLM"

papers = arxiv_query(query)

\par# Assuming first result is relevant load the PDF data

pdf_data = load_pdf_data(papers[0])

\par# Search the image of the table in the "Chinchilla LLM" paper

image_url = bing_image_search(’Chinchilla LLM paper table’)

\par# Process the image

processed_image = process_image(image_url, features=[’objects’])

%**** iclr2024_conference.tex Line 2325 ****\par# Convert the processed image to a KnowledgeFrame

knowledge_frame = read_txt(processed_image.get(’objects’, ’dummy_objects_text’))

\par# Perform statistical analysis on the KnowledgeFrame

analysis_kf = stats_analysis(knowledge_frame)

\par# Build a histogram based on the analysis

build_hist(analysis_kf)
(d)
```
Query: Could you please retrieve some images of dogs? After
that, process it and perform data augmentation using simple
image processing techniques and save the augmented images.
```
⬇

# Import necessary libraries

from typing import List, Dict, Union

import beatnum

from PIL import Image

\par# Retrieve images of dogs

%**** iclr2024_conference.tex Line 2350 ****image_urls = bing_image_search(query=’dogs’)

\par# Load the first image

image = process_image(image_urls[0])

image_array = beatnum.bdnumset(image)

\par# Perform data augmentation using simple image processing techniques

rotated_image_array = rotate(image_array, direction=’clockwise’)

flipped_image_array = flip(image_array, axis=1)

cropped_image_array = crop(image_array, scale_factor=0.5)

grayscale_image_array = to_grayscale(image_array)

\par# creating image object of above array

rotated_image_data = Image.fromarray(rotated_image_array)

flipped_image_data = Image.fromarray(flipped_image_array)

cropped_image_data = Image.fromarray(cropped_image_array)

grayscale_image_data = Image.fromarray(grayscale_image_array)

\par# Save the augmented images

rotated_image_data.save(’rotated_dog_image.png’)

flipped_image_data.save(’flipped_dog_image.png’)

cropped_image_data.save(’cropped_dog_image.png’)

grayscale_image_data.save(’grayscale_dog_image.png’)