Modelscope-Agent: Building Your Customizable Agent System With Open-Source Large Language Models
Modelscope-Agent: Building Your Customizable Agent System With Open-Source Large Language Models
Modelscope-Agent: Building Your Customizable Agent System With Open-Source Large Language Models
ing, and design planning-like behavior. To user instructions to use external tools and capture
further unleash the power of LLMs to accom- up-to-date information.
plish complex tasks, there is a growing trend to To further unleash the power of LLMs for real-
build agent framework that equips LLMs, such world practical applications, a rising trend of cur-
as ChatGPT, with tool-use abilities to connect rent research (Schick et al., 2023; Shen et al., 2023;
with massive external APIs.
Yang et al., 2023; Qin et al., 2023; Patil et al., 2023)
In this work, we introduce ModelScope-Agent, begins to enable LLMs with tool-use abilities to-
a general and customizable agent framework wards building an AI Agent. These include Hug-
for real-world applications, based on open-
gingGPT (Shen et al., 2023), Visual-ChatGPT (Wu
source LLMs as controllers. It provides a user-
friendly system library, with customizable en- et al., 2023) and Gorilla (Patil et al., 2023) for
gine design to support model training on mul- connecting with HuggingFace models, ToolAl-
tiple open-source LLMs, while also enabling paca (Tang et al., 2023) and ToolLLaMA (Qin et al.,
seamless integration with both model APIs and 2023) for using massive common APIs such as
common APIs in a unified way. To equip weather forecast and search engine. These methods
the LLMs with tool-use abilities, a compre- either directly rely on closed-source counterparts
hensive framework has been proposed span-
like ChatGPT or focus on certain types of API tools.
ning over tool-use data collection, tool retrieval,
tool registration, memory control, customized Recently, there have also been public releases of
model training, and evaluation for practical AI agents, such as Auto-GPT3 , LangChain4 and
real-world applications. Finally, we showcase Transformers Agent (Huggingface, 2023), which
ModelScopeGPT, a real-world intelligent as- enable LLMs, such as ChatGPT or GPT-4, to use
sistant of ModelScope Community based on tools and solve complex AI tasks. However, these
the ModelScope-Agent framework, which is agents are mainly built with closed-source LLMs
able to connect open-source LLMs with more and how to build a customizable agent system with
than 1000 public AI models and localized
open-source LLMs remains largely unexplored.
community knowledge in ModelScope. The
ModelScope-Agent library1 and online demo2 In this work, we present ModelScope-Agent, a
are now publicly available. general and customizable agent system for real-
world applications, based on open-source LLMs
1 Introduction as controllers. ModelScope5 is a public ML com-
Large language models (OpenAI, 2022, 2023; munity, which seeks to bring together the most ad-
Touvron et al., 2023; Chowdhery et al., 2022) vanced machine learning models from the AI com-
have gradually become common AI assistants munity, and streamlines the process of leveraging
that demonstrate great potential in comprehend- AI models in real-world applications. ModelScope-
ing human intentions, performing complex rea- Agent provides a flexible and user-friendly sys-
soning tasks, and enabling content creation. De- tem library, with customizable engine design to
∗ 3
Corresponding author: <ym119608@alibaba-inc.com> https://github.com/Significant-Gravitas/Auto-GPT
1 4
https://github.com/modelscope/modelscope-agent https://github.com/langchain-ai/langchain
2 5
https://modelscope.cn/studios/damo/ModelScopeGPT/summary https://modelscope.cn/models
support model training on multiple open-source • Agent based on Open-Source LLMs: the con-
LLMs, while also enabling seamless integration troller of ModelScope-Agent can be flexibly
with both model APIs and common APIs in a uni- selected from open-source LLMs that are opti-
fied way. It features an LLM-centric system de- mized through our agent training framework.
sign, which includes open-source LLMs as core
• Support and Customization of Diverse Tools:
controller, and further interact with a tool-use mod-
Dozens of diverse model APIs and common
ule and a memory module to accomplish complex
APIs are given by default. The library sup-
tasks. At the core of ModelScope-Agent , the li-
ports registering new self-defined APIs and
brary supports flexible selection and training on var-
automatic API retrieval from the toolset.
ious open-source LLMs, such as LLaMA (Touvron
et al., 2023), ChatGLM (THUDM, 2023), Chat- • Customizable of Applications: ModelScope-
PLUG (Tian et al., 2023) and other customized Agent can be flexibly applied in various in-
LLMs in ModelScope. For tool use, ModelScope- dustry applications. The agent and training
Agent provides a default tool library, which sup- framework are documented describing its us-
ports diverse AI model APIs across NLP, CV, Au- age, construction and optimization.
dio and Multi-model fields, as well as massive com-
mon APIs such as search engine. It also supports ModelScope-Agent is in continual development
registering new self-defined API plugins and auto- by the engineers at ModelScope and is released
matic API retrieval from the large tool library. It is under an Apache 2.0 license. Full documentation
easy for users to customize their most appropriate is available through the project website.
LLMs, local API tools and functions to develop
2 The ModelScope Agent
real-world applications. Moreover, a memory mod-
ule is also introduced to better store and manage the ModelScope-Agent is designed to facilitate devel-
system message, user history, in-context examples, opers in building customizable agent systems based
tool message and localized knowledge. on open-source LLMs. The overall system architec-
ture is shown in Figure 1. It includes open-source
To enable the open-source LLMs to better con-
LLMs as controller, a tool-use module and a mem-
trol the whole agent system, we further propose
ory module to interact with. Given human instruc-
a comprehensive framework of tool-use data col-
tion, the Agent, which adopts the selected LLM as
lection, customized model training, evaluation and
the controller, will automatically plan tasks, selec-
deployment. Notably, we release a comprehen-
tively uses tools, leverage knowledge in memory,
sive tool-enhanced dataset MSAgent-Bench, which
and finally provides helpful responses to users.
consists of 598k dialogues with various API cat-
egories, multi-turn API calls, API-Oriented QA, 2.1 LLMs as Brain
and API-Agnostic instructions in both English and
LLMs serve as the brain of the agent, responsible
Chinese. A simple training strategy of Weighted
for planning and decomposing user requests, se-
LM, that enhances the training of generation of
lectively calling tools, performing retrieval, and
API name and parameters, is used to better ensure
integrating all the information from previous steps
the correctness of API calls. Besides, an evalua-
to generate the final response. In order to make it
tion framework is also supported in our library to
easier for users to customize the agent with their
examine the tool-use abilities of the trained mod-
own LLMs, we have added support for various
els in different aspects. Furthermore, we applied
open-source LLMs by default, such as LLaMA,
ModelScope-Agent in a real-world application of
ChatGLM and ChatPLUG, which have been op-
ModelScope Community namely ModelScopeGPT,
timized through our tool learning pipeline. The
which is able to connect open-source LLMs with
details of training strategy and tool-use datasets
more than 1000 public AI models and access lo-
can be referred to Section 3. ModelScope-Agent
calized community knowledge in ModelScope for
has integrated the LLM inference pipeline of the
community QA.
ModelScope community, and replacing LLMs can
To summarize, ModelScope-Agent is a general be done by simply setting the model_name and
and customizable agent system designed for devel- model_config. In model_config, the model_id,
opers to harness the power of open-source LLMs. model_revision, and model parameter settings such
The library targets the following goals: as max sequence length, should be configured.
Figure 1: The overall system architecture of ModelScope-Agent.
# LLM config " cfg_file " request functions. More details about CustomTool
from modelscope . utils . config import Config can be referred in Appendix A.2.
model_cfg = Config . from_file ( cfg_file )
llm = LocalLLM ( model_name , model_cfg ) from modelscope_agent . tools import Tool
class CustomTool ( Tool ):
Furthermore, the ModelScope-Agent also pro- # logic added here
# refer example in Appendix A .2
vides a standard way to integrate new LLM. Users tool_list = { ’ customo - tool ’: CustomTool ()}
can add their own LLMs, by integrating the LLM
pipeline into ModelScope. After that, the agent can
select the new LLMs for training and inference. Tool Retrieval and Execution Due to the large
amount of tool APIs in the tool library, a tool
2.2 Tool Use retrieval module is further introduced to recom-
mend appropriate APIs for each instruction prompt.
Tool Library The tool library is used to config-
Specifically, we use the dense vector retrieval
ure and manage various collections of APIs used in
method based on the unified multilingual text-
the agent. ModelScope-Agent can support a wide
embedding API 6 . We vectorize both the text de-
range of both common APIs such as search APIs,
scriptions of the APIs and the instruction prompt
and AI model APIs across NLP, CV, Audio and
using the text-embedding API. The top-3 most rel-
Multi-modal models in ModelScope and Hugging-
evant APIs with the highest vector product scores
Face. Each tool API consists of the API name, de-
are selected for tool use. As a result, the schema
scription, parameters and request functions. Users
information of the retrieved APIs will be concate-
can easily choose and configure proper APIs in
nated with other system prompts in the subsequent
the library to build their own agent. The default
memory module and sent to LLMs as input. With
APIs supported in the library can be referred to
the concatenated instruction prompt, the LLMs will
Appendix A.1.
plan and generate the API request, which will be
# tool default config file " default_file " executed by the agent. The agent will then return
tool_cfg = Config . from_file ( default_file )
the results to the LLMs for continuous generation.
Register and Customize New Tool The agent 2.3 Memory Control
allows users to register and customize new tools, The memory module is used to retrieve, and assem-
while also supporting quick integration of newly ble a series of contextual information as input to the
registered tools into the agent, enabling LLMs to LLMs. It consists of a knowledge retrieval submod-
selectively use the additional self-defined tools for ule and a prompt generator submodule, which are
specific applications. This can be simply done responsible for external knowledge retrieval and
by inheriting from a base class, namely Tool, and instruction prompt generation, respectively.
defining a new CustomTool with the API-related
6
schema of API name, description, parameters, and https://help.aliyun.com/zh/dashscope/getting-started-1
Knowledge Retrieval It enables the agent to dataset, MSAgent-Bench7 , utilizing ChatGPT syn-
get access to up-to-date and localized information thetic data and the existing instruction-following
related with query prompt, thereby augmenting datasets. Our released dataset encompasses 598k
LLMs with dynamic and domain-specific knowl- dialogues. Table 1 outlines the key differences
edge. We follow the same dense vector retrieval between the released dataset and other public avail-
method as the previous tool retrieval module, and able tool learning datasets, while the data distribu-
support large-scale knowledge retrieval from local- tion of our dataset is illustrated in Figure 2. As
ized document corpus. Similarly, it allows users demonstrated in the Table and Figure, we have
to customize by changing to other open-source re- made certain efforts to construct a comprehensive
trieval frameworks. dataset which enables the effective training of an
agent:
Prompt Generator The prompt generator is used Multilingual: We collect instances in both Chi-
to assemble all available contextual information nese and English, ensuring that the trained agent is
such as system prompt, API schema, retrieved capable of functioning in both languages.
knowledge, conversation history, and few-shot ex- Various API Categories: Our dataset supports
amples. According to the type of user query and Common APIs that have been registered by users
the maximum length of the LLM, the users can or applied through online API platforms, as well as
selectively choose proper contextual information model APIs that can call neural models.
and assemble the required input to the LLM. In our Multi Turn Dialog: In real-life scenarios, agents
agent, the prompt generator needs to be defined may need to request more specific clarification
before the agent is constructed. from users to complete a task or receive additional
instructions after completing a previous task. Our
2.4 Agent Pipeline dataset accounts for these scenarios and supports
In summary, we build the agent by combining all multi-turn user-agent interactions when using tools.
the modules: LLM controller, tool-use module, and API-Oriented QA: An effective agent should pos-
memory module. With agent.run, the agent can ef- sess knowledge of APIs. Our dataset incorporates
ficiently execute and complete the instruction in API document QA tasks and task planning tasks
a one-step generation. First, the agent retrieves which requires agents to offer appropriate sugges-
query-related tools through the tool retrieval and tions to users on how to use various APIs to solve
combines the retrieved API schema with other con- complex tasks.
textual prompts in memory module, to construct API-Agnostic Instructions: To enhance the
a new instruction prompt. Then, the agent sends agent’s ability to follow common instructions and
this new prompt to the LLM, who plans whether increase user engagement, we have incorporated
and which API to call and generate an API request. both Chinese and English API-agnostic instructions
Next, the agent will execute the selected API with within our dataset. These instructions place greater
the extracted API parameters and return the API emphasis on the agent’s inherent capabilities rather
results to the LLMs, which will continue to plan than reliance on API invocation.
whether to call other APIs. If another API call The data was collected by prompting ChatGPT
is needed, the process is repeated, otherwise, the (gpt-3.5-turbo) to generate instructions, API re-
LLMs generate the final response and the agent quests, and answers based on the API calling re-
returns the final result to the user. sults, more details can be accessed in Appendix D.
agent = AgentExecutor ( llm , tool_cfg ,
additional_tool_list = tool_list ) 3.2 Model Training
agent . run ( " Draw a logo image of agent ")
We use the MSAgent-Bench to fine-tune multi-
ple open-source LLMs, including LLaMA (Tou-
3 Training vron et al., 2023), Qwen (QwenLM, 2023), Chat-
PLUG (Tian et al., 2023) etc. We train all the open-
3.1 Dataset source LLMs in a multi-round conversation mode
To facilitate building an agent with the ability and concatenate all the prompts and answers. Com-
to use tools while upholding an optimal level of 7
https://modelscope.cn/datasets/damo/MSAgent-
user engagement, we release a comprehensive tool Bench/summary
Dataset Language Instance Type # Instances API type Avg. Turn Avg. Step
API-Bank (Li et al., 2023) English Tool Use 264 Common API 3.27 1.92
ToolAlpaca (Tang et al., 2023) English Tool Use 3.9 K Common API 1 1.66
Gorilla (Patil et al., 2023) English Tool Use 16.4 k Model API 1 1
GPT4Tools (Yang et al., 2023) English Tool Use 71.4 K Model API 1 1
ToolBench (Qin et al., 2023) English Tool Use 26.9 K Common API 1 4.1
MSAgent-Bench (ours) English + Chinese Tool Use + Common Chat 598 K Common API + Model API 1.52 1.31
Table 1: The statistics of MSAgent-Bench and other existing tool learning datasets.
pared to common instruction tuning data, the tool measures the similarity between the generated re-
learning samples focus more heavily on the accu- sponse and the gold answer. Additionally, we intro-
racy of tool selection and API parameter prediction. duce a novel metric called Argument F1 for fully
Therefore, we propose a simple training strategy, evaluating the quality of API requests. To com-
Weighted LM, which enhances the training of gen- pute Argument F1, we categorize the arguments in
eration of API name and parameters, while zero- agent’s API request into two cases, namely Half
out the loss of tokens from the user prompt and the match (HM) and Full match (FM), representing
tool execution. More details can be be referred to correct argument but with wrong value and correct
Appendix B.3. argument with correct value, respectively. Suppose
kwargs = dict ( model = model , ...) the gold argument number in the API is |A|, and
trainer : EpochBasedTrainer = build_trainer the number of arguments in the agents API request
( name = args . trainer , default_args = kwargs ) is |A∗ |, we compute the new Recall and Precision
trainer . train ()
as follows:
In automatic evaluation, we mainly focus on eval- Expert annotators were engaged to annotate the
uating agent’s ability to generate accurate API re- evaluation instances, with the task of providing
quest and the proper answers according to the API diverse instructions, manually documenting cor-
calling results. Specifically, we use the action ex- rect API calling requests, and writing appropriate
actly match score (Action EM) which measures responses. The statistics of our currently assem-
whether the agent uses the correct API as the ref- bled test data is in Appendix B.1, and the auto-
erence gold API, and the ROUGE-L score which matic evaluation scores of our trained agents can
(a) ModelScope Intelligent Assistant (b) Register and Use New Tools on Alibaba Cloud
be found in Appendix B.2. We also guarantee the call parameters using information from previous
users to upload their own annotated test examples conversations. More cases can refer to Appendix C.
to accurately evaluate the performance of agents in As a result, ModelScopeGPT has achieved a total
customized scenarios. request number of over 170k from 40k user visits
within one month after its release.
4.2 Human Evaluation with Agent Arena
Inspired by the Arena for ChatBots (Zheng et al., Register and Use New Tools Another key fea-
2023), we have built an accessible Agent Arena 8 ture of an agent is its generalization capability to
that allows users to furnish instructions to two unseen APIs. This allows users to quickly register
anonymous agents, based on the provided APIs. their own APIs and customize their specific applica-
Subsequently, users have the opportunity to vote tions. Therefore, we test the generalization ability
on which Agent performs better in tackling the in- of ModelScopeGPT by applying it to an Alibaba
struction with the given APIs. In accordance with Cloud application scenario. As shown in Figure 3b,
the framework presented by Zheng et al. (2023), we first found an API for renewing an ECS in-
we adopt a system of ELO ratings and leaderboard stance on Alibaba Cloud. Then, we registered the
maintenance for the participating Agents. API schema defined in the tool library to the agent.
Finally, we entered the prompt "Please help me re-
5 Usage Example of ModelScopeGPT new an ECS..." in the demo. The agent generated a
request through planning, selected the appropriate
In this section, we showcase a successful
API, called the API to renew the instance success-
application of ModelScope Community, Mod-
fully, and provided a reply to inform the user that
elScopeGPT9 , based on our ModelScope-Agent.
the renewal was completed. This test demonstrates
ModelScope Intelligent Assistant Based on that the open-source LLM optimized based on the
ModelScope-Agent , we have developed an intel- released API dataset has a strong generalization
ligent assistant for the ModelScope Community, ability towards unseen APIs.
namely ModelScopeGPT. It uses LLMs as a con-
troller to connect dozens of domain-specific AI 6 Conclusion
models in the ModelScope open-source community,
covering NLP, CV, Audio, and Multi-Modal fields. ModelScope-Agent aims to facilitate building AI
To make the pipeline more practical, we have in- Agent applications and research based on open-
cluded API retrieval and knowledge retrieval tool to source LLMs by providing a general and customiz-
automatically select proper APIs and get access to able agent framework covering flexible system de-
the local ModelScope knowledge. As shown in Fig- sign, data collection, model training, evaluation
ure 3a, ModelScopeGPT can support API calls in and usage example in real-world application. It
multi-turn conversations and generate correct API provides an open-source, community-driven library
towards AI Agent learning and best practices for
8
https://modelscope.cn/studios/LLMZOO/Chinese- building an agent system with open-source LLMs.
Arena/summary
9
https://modelscope.cn/studios/damo/ModelScopeGPT We hope ModelScope-Agent can help pave the way
/summary towards a new era of AI Agent.
Ethics Statement Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts,
Intended Use. ModelScope-Agent is designed Paul Barham, Hyung Won Chung, Charles Sutton,
to facilitate building AI Agent applications and Sebastian Gehrmann, Parker Schuh, Kensen Shi,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek
research based on open-source LLMs, by providing Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
a general and customizable agent system. odkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob
Potential Misuse. Although we have only trained Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
with the tool-use datasets and gone through certain Toju Duke, Anselm Levskaya, Sanjay Ghemawat,
Sunipa Dev, Henryk Michalewski, Xavier Garcia,
data filtering rules, it is still possible that the cus- Vedant Misra, Kevin Robinson, Liam Fedus, Denny
tomized model may generate some biased, fake, Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim,
and unsafe information. Our agent framework also Barret Zoph, Alexander Spiridonov, Ryan Sepassi,
provides users with the freedom to select proper David Dohan, Shivani Agrawal, Mark Omernick, An-
LLMs and upload their own clean data for training. drew M. Dai, Thanumalayan Sankaranarayana Pil-
lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira,
It is also important to design specific methods to Rewon Child, Oleksandr Polozov, Katherine Lee,
improve the safety of the agent framework in the Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark
future. Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov,
and Noah Fiedel. 2022. Palm: Scaling language mod-
eling with pathways. CoRR, abs/2204.02311.
References
Jordan Hoffmann, Sebastian Borgeaud, Arthur Men-
Michael Ahn, Anthony Brohan, Noah Brown, Yev- sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther-
gen Chebotar, Omar Cortes, Byron David, Chelsea ford, Diego de Las Casas, Lisa Anne Hendricks,
Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Johannes Welbl, Aidan Clark, et al. 2022. Train-
Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, ing compute-optimal large language models. arXiv
Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, preprint arXiv:2203.15556.
Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jes-
month, Nikhil J Joshi, Ryan Julian, Dmitry Kalash- Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky
nikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Liang, Pete Florence, Andy Zeng, Jonathan Tompson,
Levine, Yao Lu, Linda Luu, Carolina Parada, Pe- Igor Mordatch, Yevgen Chebotar, Pierre Sermanet,
ter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Tomas Jackson, Noah Brown, Linda Luu, Sergey
Rettinghouse, Diego Reyes, Pierre Sermanet, Nico- Levine, Karol Hausman, and brian ichter. 2023. In-
las Sievers, Clayton Tan, Alexander Toshev, Vincent ner monologue: Embodied reasoning through plan-
Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, ning with language models. In Proceedings of The
Mengyuan Yan, and Andy Zeng. 2022. Do as i can, 6th Conference on Robot Learning, volume 205 of
not as i say: Grounding language in robotic affor- Proceedings of Machine Learning Research, pages
dances. arXiv preprint arXiv:2204.01691. 1769–1782. PMLR.
Huggingface. 2023. Transformers agent. Website.
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al-
https://huggingface.co/docs/transformers/
shamsi, Alessandro Cappelli, Ruxandra Cojocaru,
transformers_agents.
Merouane Debbah, Etienne Goffinet, Daniel Heslow,
Julien Launay, Quentin Malartic, et al. 2023. Falcon- Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu,
40b: an open large language model with state-of-the- Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-
art performance. bank: A benchmark for tool-augmented llms. arXiv
preprint arXiv:2304.08244.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Adam Roberts, Stella Biderman, Teven Le Scao,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey
Gretchen Krueger, Tom Henighan, Rewon Child, Schoelkopf, et al. 2022. Crosslingual generaliza-
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, tion through multitask finetuning. arXiv preprint
Clemens Winter, Christopher Hesse, Mark Chen, Eric arXiv:2211.01786.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
OpenAI. 2022. Chatgpt: Optimizing language models
Jack Clark, Christopher Berner, Sam McCandlish,
for dialogue.
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In Ad- OpenAI. 2023. GPT-4 technical report. CoRR,
vances in Neural Information Processing Systems 33: abs/2303.08774.
Annual Conference on Neural Information Process-
ing Systems 2020, NeurIPS 2020, December 6-12, Shishir G. Patil, Tianjun Zhang, Xin Wang, and
2020, virtual. Joseph E. Gonzalez. 2023. Gorilla: Large language
model connected with massive apis. arXiv preprint Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge,
arXiv:2305.15334. Xiu Li, and Ying Shan. 2023. Gpt4tools: Teaching
large language model to use tools via self-instruction.
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, arXiv preprint arXiv:2305.18752.
Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang,
Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang,
Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Joseph E. Gonzalez, and Ion Stoica. 2023. Judg-
Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, ing llm-as-a-judge with mt-bench and chatbot arena.
Yaxi Lu, Weilin Zhao, Yuxiang Huang, Junxi Yan, arXiv preprint arXiv:2306.05685.
Xu Han, Xian Sun, Dahai Li, Jason Phang, Cheng
Yang, Tongshuang Wu, Heng Ji, Zhiyuan Liu, and
Maosong Sun. 2023. Tool learning with foundation
models. arXiv preprint arXiv:2304.08354.
QwenLM. 2023. Qwen-7b.
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie
Millican, Jordan Hoffmann, Francis Song, John
Aslanides, Sarah Henderson, Roman Ring, Susan-
nah Young, et al. 2021. Scaling language models:
Methods, analysis & insights from training gopher.
arXiv preprint arXiv:2112.11446.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola
Cancedda, and Thomas Scialom. 2023. Toolformer:
Language models can teach themselves to use tools.
arXiv preprint arXiv:2302.04761.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li,
Weiming Lu, and Yueting Zhuang. 2023. Hugging-
gpt: Solving ai tasks with chatgpt and its friends in
hugging face. arXiv preprint arXiv:2303.17580.
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han,
Qiao Liang, and Le Sun. 2023. Toolalpaca: Gener-
alized tool learning for language models with 3000
simulated cases. arXiv preprint arXiv:2306.05301.
THUDM. 2023. Chatglm. https://github.com/
THUDM/ChatGLM-6B.
Junfeng Tian, Hehong Chen, Guohai Xu, Ming Yan,
Xing Gao, Jianhai Zhang, Chenliang Li, Jiayi Liu,
Wenshen Xu, Haiyang Xu, Qi Qian, Wei Wang, Qing-
hao Ye, Jiejing Zhang, Ji Zhang, Fei Huang, and
Jingren Zhou. 2023. Chatplug: Open-domain gen-
erative dialogue system with internet-augmented in-
struction tuning for digital human. arXiv preprint
arXiv:2304.07849.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open
and efficient foundation language models. arXiv
preprint arXiv:2302.13971.
Chenfei Wu, Shengming Yin, Weizhen Qi, Xi-
aodong Wang, Zecheng Tang, and Nan Duan.
2023. Visual chatgpt: Talking, drawing and edit-
ing with visual foundation models. arXiv preprint
arXiv:2303.04671.
A Library dataset consisted of 360 conversations with 2059
text snippets as the references to be compared with
A.1 Tool List
the agent prediction, which comprise 798 API re-
API Name (language) Description Type
qusts and 1261 plain text answers according to the
Text-to-Image(en) Converts text to an image. Model API previous calling results.
Text-to-Image(zh) Converts text to an image. Model API
Text-to-Video(en) Converts text to a video. Model API
Text-to-Audio(en) Converts text to audio. Model API
Text-to-Audio(zh) Converts text to audio. Model API
Image-Chat(en) Image chat. Model API B.2 Evaluation Results
Translation-zh2en Translates Chinese text to English. Model API
Translation-en2zh Translates English text to Chinese. Model API
Universal-IE(zh) Extracts structured information. Model API
Text-to-Geographic(zh) Extracts geographic information. Model API
NER(zh) Recognizes named entities in text. Model API Model ROUGE-L Action EM Argument F1
API-Retrieval Retrieves relevant APIs Common API ChatGPT (2-shot)∗ 36.70 34.82 25.51
ModelScope-Retrieval Retrieves modelscope docs. Common API LLaMA 39.16 58.60 44.98
ChatPLUG11 46.45 68.29 55.12
MSAgent-Qwen12 51.35 87.23 68.09
Table 2: The statistics of default tool list. Supported
input languages for the APIs are listed in parentheses. Table 3: Automatic evaluation results. ∗ represents that
we do not fine-tune ChatGPT but use in-context learning
A.2 CustomTool with 2 demonstrations.
Figure 5: Single-step tool-use instructions, text-to-video cases. We have captured a few frames of the video to
display. Testing the model using the same semantic instruction in both English (left) and Chinese (right).
Figure 6: Single-step tool-use instructions, image-chat cases. Testing the model using the same semantic instruction
in both English (left) and Chinese (right).
C Cases tional copy first, then read it, and finally generate a
video. These instructions require the model to have
In this section, we show the qualitative results the ability of multi-step Tool use. In the Chinese
about ModelScopeGPT implementation based on case, our model accurately completed the three-
ModelScope-Agent. step tool use.
Single-step Tool Use As shown in Figure 5 and
6, the instruction expects the model to generate a
video and chat about the image respectively. These Multi-turn Tool Use As shown in Figure 8, the
instructions can be completed with a single step of instruction requires the model to have the ability to
tool use. multi-turn conversation and use the history conver-
sation. Our model can accurately call the API and
Multi-step Tool Use As shown in Figure 7, the capture the content of the previous conversation to
instruction expects the model to write the promo- generate API parameters.
Figure 7: Multi-step tool-use instructions. We have captured a few frames of the video to display. Testing the model
using the same semantic instruction in both English(left) and Chinese(right).
Figure 8: Multi-turn tool-use instructions, text-to-speech and text-to-image cases. Testing the model using the same
semantic instruction in both English(left) and Chinese(right).
Figure 9: Multi-turn tool-use instructions, text-to-speech and text-to-image cases. Testing the model using the same
semantic instruction in both English(left) and Chinese(right).
In-domain Knowledge QA As shown in Figure domain knowledge and use the retrieved knowledge
9, the instruction requires the model to retrieve in-
to answer questions. loop would continue until the agent determined
that it was appropriate to terminate the conversa-
tion with the final answer. After acquiring the raw
as User dataset, we applied filtering mechanisms to elim-
inate instances in which ChatGPT generated API
requests containing hallucinated API names and
Instruction or Follow-up or parameters that were absent from the retrieved API.
Clarification Final Answer
Additionally, we excluded instances in which Chat-
GPT generated illegal API requests, thus resulting
in a refined and finalized dataset.
as Agent
As introduced in Section 3.1, we collect in-
stances across different languages and topics, the
detailed statistics of our collected data are shown
API request in Table 4.
Result
Instance Type # Instances
API Gallery Chinese 532,436
English 66,444
Common API 211,026
Model API 58,338
API-Oriented QA 5,000
API-Agnostic Instruction 329,776
Figure 10: The data collection procedure of MSAgent-
Bench. Table 4: The statistics of our collected dataset.