Modelscope-Agent: Building Your Customizable Agent System With Open-Source Large Language Models

ModelScope-Agent: Building Your Customizable Agent System with
Open-source Large Language Models

Chenliang Li, Hehong Chen, Ming Yan∗, Weizhou Shen, Haiyang Xu, Zhikai Wu
Zhicheng Zhang, Wenmeng Zhou, Yingda Chen, Chen Cheng, Hongzhu Shi
Ji Zhang, Fei Huang, Jingren Zhou
DAMO Academy, Alibaba Group, China
Abstract spite the rapid advancements of open-source LLMs,

e.g., LLaMA (Touvron et al., 2023) and Chat-
Large language models (LLMs) have recently
GLM (THUDM, 2023), they still remain limited
demonstrated remarkable capabilities to com-
prehend human intentions, engage in reason- in performing complex tasks, such as following
arXiv:2309.00986v1 [cs.CL] 2 Sep 2023
ing, and design planning-like behavior. To user instructions to use external tools and capture
further unleash the power of LLMs to accom- up-to-date information.
plish complex tasks, there is a growing trend to To further unleash the power of LLMs for real-
build agent framework that equips LLMs, such world practical applications, a rising trend of cur-
as ChatGPT, with tool-use abilities to connect rent research (Schick et al., 2023; Shen et al., 2023;
with massive external APIs.
Yang et al., 2023; Qin et al., 2023; Patil et al., 2023)
In this work, we introduce ModelScope-Agent, begins to enable LLMs with tool-use abilities to-
a general and customizable agent framework wards building an AI Agent. These include Hug-
for real-world applications, based on open-
gingGPT (Shen et al., 2023), Visual-ChatGPT (Wu
source LLMs as controllers. It provides a user-
friendly system library, with customizable en- et al., 2023) and Gorilla (Patil et al., 2023) for
gine design to support model training on mul- connecting with HuggingFace models, ToolAl-
tiple open-source LLMs, while also enabling paca (Tang et al., 2023) and ToolLLaMA (Qin et al.,
seamless integration with both model APIs and 2023) for using massive common APIs such as
common APIs in a unified way. To equip weather forecast and search engine. These methods
the LLMs with tool-use abilities, a compre- either directly rely on closed-source counterparts
hensive framework has been proposed span-
like ChatGPT or focus on certain types of API tools.
ning over tool-use data collection, tool retrieval,
tool registration, memory control, customized Recently, there have also been public releases of
model training, and evaluation for practical AI agents, such as Auto-GPT3 , LangChain4 and
real-world applications. Finally, we showcase Transformers Agent (Huggingface, 2023), which
ModelScopeGPT, a real-world intelligent as- enable LLMs, such as ChatGPT or GPT-4, to use
sistant of ModelScope Community based on tools and solve complex AI tasks. However, these
the ModelScope-Agent framework, which is agents are mainly built with closed-source LLMs
able to connect open-source LLMs with more and how to build a customizable agent system with
than 1000 public AI models and localized
open-source LLMs remains largely unexplored.
community knowledge in ModelScope. The
ModelScope-Agent library1 and online demo2 In this work, we present ModelScope-Agent, a
are now publicly available. general and customizable agent system for real-
world applications, based on open-source LLMs
1 Introduction as controllers. ModelScope5 is a public ML com-
Large language models (OpenAI, 2022, 2023; munity, which seeks to bring together the most ad-
Touvron et al., 2023; Chowdhery et al., 2022) vanced machine learning models from the AI com-
have gradually become common AI assistants munity, and streamlines the process of leveraging
that demonstrate great potential in comprehend- AI models in real-world applications. ModelScope-
ing human intentions, performing complex rea- Agent provides a flexible and user-friendly sys-
soning tasks, and enabling content creation. De- tem library, with customizable engine design to
∗ 3
Corresponding author: <ym119608@alibaba-inc.com> https://github.com/Significant-Gravitas/Auto-GPT
1 4
https://github.com/modelscope/modelscope-agent https://github.com/langchain-ai/langchain
2 5
https://modelscope.cn/studios/damo/ModelScopeGPT/summary https://modelscope.cn/models
support model training on multiple open-source • Agent based on Open-Source LLMs: the con-
LLMs, while also enabling seamless integration troller of ModelScope-Agent can be flexibly
with both model APIs and common APIs in a uni- selected from open-source LLMs that are opti-
fied way. It features an LLM-centric system de- mized through our agent training framework.
sign, which includes open-source LLMs as core
• Support and Customization of Diverse Tools:
controller, and further interact with a tool-use mod-
Dozens of diverse model APIs and common
ule and a memory module to accomplish complex
APIs are given by default. The library sup-
tasks. At the core of ModelScope-Agent , the li-
ports registering new self-defined APIs and
brary supports flexible selection and training on var-
automatic API retrieval from the toolset.
ious open-source LLMs, such as LLaMA (Touvron
et al., 2023), ChatGLM (THUDM, 2023), Chat- • Customizable of Applications: ModelScope-
PLUG (Tian et al., 2023) and other customized Agent can be flexibly applied in various in-
LLMs in ModelScope. For tool use, ModelScope- dustry applications. The agent and training
Agent provides a default tool library, which sup- framework are documented describing its us-
ports diverse AI model APIs across NLP, CV, Au- age, construction and optimization.
dio and Multi-model fields, as well as massive com-
mon APIs such as search engine. It also supports ModelScope-Agent is in continual development
registering new self-defined API plugins and auto- by the engineers at ModelScope and is released
matic API retrieval from the large tool library. It is under an Apache 2.0 license. Full documentation
easy for users to customize their most appropriate is available through the project website.
LLMs, local API tools and functions to develop
2 The ModelScope Agent
real-world applications. Moreover, a memory mod-
ule is also introduced to better store and manage the ModelScope-Agent is designed to facilitate devel-
system message, user history, in-context examples, opers in building customizable agent systems based
tool message and localized knowledge. on open-source LLMs. The overall system architec-
ture is shown in Figure 1. It includes open-source
To enable the open-source LLMs to better con-
LLMs as controller, a tool-use module and a mem-
trol the whole agent system, we further propose
ory module to interact with. Given human instruc-
a comprehensive framework of tool-use data col-
tion, the Agent, which adopts the selected LLM as
lection, customized model training, evaluation and
the controller, will automatically plan tasks, selec-
deployment. Notably, we release a comprehen-
tively uses tools, leverage knowledge in memory,
sive tool-enhanced dataset MSAgent-Bench, which
and finally provides helpful responses to users.
consists of 598k dialogues with various API cat-
egories, multi-turn API calls, API-Oriented QA, 2.1 LLMs as Brain
and API-Agnostic instructions in both English and
LLMs serve as the brain of the agent, responsible
Chinese. A simple training strategy of Weighted
for planning and decomposing user requests, se-
LM, that enhances the training of generation of
lectively calling tools, performing retrieval, and
API name and parameters, is used to better ensure
integrating all the information from previous steps
the correctness of API calls. Besides, an evalua-
to generate the final response. In order to make it
tion framework is also supported in our library to
easier for users to customize the agent with their
examine the tool-use abilities of the trained mod-
own LLMs, we have added support for various
els in different aspects. Furthermore, we applied
open-source LLMs by default, such as LLaMA,
ModelScope-Agent in a real-world application of
ChatGLM and ChatPLUG, which have been op-
ModelScope Community namely ModelScopeGPT,
timized through our tool learning pipeline. The
which is able to connect open-source LLMs with
details of training strategy and tool-use datasets
more than 1000 public AI models and access lo-
can be referred to Section 3. ModelScope-Agent
calized community knowledge in ModelScope for
has integrated the LLM inference pipeline of the
community QA.
ModelScope community, and replacing LLMs can
To summarize, ModelScope-Agent is a general be done by simply setting the model_name and
and customizable agent system designed for devel- model_config. In model_config, the model_id,
opers to harness the power of open-source LLMs. model_revision, and model parameter settings such
The library targets the following goals: as max sequence length, should be configured.
Figure 1: The overall system architecture of ModelScope-Agent.
# LLM config " cfg_file " request functions. More details about CustomTool
from modelscope . utils . config import Config can be referred in Appendix A.2.
model_cfg = Config . from_file ( cfg_file )
llm = LocalLLM ( model_name , model_cfg ) from modelscope_agent . tools import Tool
class CustomTool ( Tool ):
Furthermore, the ModelScope-Agent also pro- # logic added here
# refer example in Appendix A .2
vides a standard way to integrate new LLM. Users tool_list = { ’ customo - tool ’: CustomTool ()}
can add their own LLMs, by integrating the LLM
pipeline into ModelScope. After that, the agent can
select the new LLMs for training and inference. Tool Retrieval and Execution Due to the large
amount of tool APIs in the tool library, a tool
2.2 Tool Use retrieval module is further introduced to recom-
mend appropriate APIs for each instruction prompt.
Tool Library The tool library is used to config-
Specifically, we use the dense vector retrieval
ure and manage various collections of APIs used in
method based on the unified multilingual text-
the agent. ModelScope-Agent can support a wide
embedding API 6 . We vectorize both the text de-
range of both common APIs such as search APIs,
scriptions of the APIs and the instruction prompt
and AI model APIs across NLP, CV, Audio and
using the text-embedding API. The top-3 most rel-
Multi-modal models in ModelScope and Hugging-
evant APIs with the highest vector product scores
Face. Each tool API consists of the API name, de-
are selected for tool use. As a result, the schema
scription, parameters and request functions. Users
information of the retrieved APIs will be concate-
can easily choose and configure proper APIs in
nated with other system prompts in the subsequent
the library to build their own agent. The default
memory module and sent to LLMs as input. With
APIs supported in the library can be referred to
the concatenated instruction prompt, the LLMs will
Appendix A.1.
plan and generate the API request, which will be
# tool default config file " default_file " executed by the agent. The agent will then return
tool_cfg = Config . from_file ( default_file )
the results to the LLMs for continuous generation.
Register and Customize New Tool The agent 2.3 Memory Control
allows users to register and customize new tools, The memory module is used to retrieve, and assem-
while also supporting quick integration of newly ble a series of contextual information as input to the
registered tools into the agent, enabling LLMs to LLMs. It consists of a knowledge retrieval submod-
selectively use the additional self-defined tools for ule and a prompt generator submodule, which are
specific applications. This can be simply done responsible for external knowledge retrieval and
by inheriting from a base class, namely Tool, and instruction prompt generation, respectively.
defining a new CustomTool with the API-related
6
schema of API name, description, parameters, and https://help.aliyun.com/zh/dashscope/getting-started-1
Knowledge Retrieval It enables the agent to dataset, MSAgent-Bench7 , utilizing ChatGPT syn-
get access to up-to-date and localized information thetic data and the existing instruction-following
related with query prompt, thereby augmenting datasets. Our released dataset encompasses 598k
LLMs with dynamic and domain-specific knowl- dialogues. Table 1 outlines the key differences
edge. We follow the same dense vector retrieval between the released dataset and other public avail-
method as the previous tool retrieval module, and able tool learning datasets, while the data distribu-
support large-scale knowledge retrieval from local- tion of our dataset is illustrated in Figure 2. As
ized document corpus. Similarly, it allows users demonstrated in the Table and Figure, we have
to customize by changing to other open-source re- made certain efforts to construct a comprehensive
trieval frameworks. dataset which enables the effective training of an
agent:
Prompt Generator The prompt generator is used Multilingual: We collect instances in both Chi-
to assemble all available contextual information nese and English, ensuring that the trained agent is
such as system prompt, API schema, retrieved capable of functioning in both languages.
knowledge, conversation history, and few-shot ex- Various API Categories: Our dataset supports
amples. According to the type of user query and Common APIs that have been registered by users
the maximum length of the LLM, the users can or applied through online API platforms, as well as
selectively choose proper contextual information model APIs that can call neural models.
and assemble the required input to the LLM. In our Multi Turn Dialog: In real-life scenarios, agents
agent, the prompt generator needs to be defined may need to request more specific clarification
before the agent is constructed. from users to complete a task or receive additional
instructions after completing a previous task. Our
2.4 Agent Pipeline dataset accounts for these scenarios and supports
In summary, we build the agent by combining all multi-turn user-agent interactions when using tools.
the modules: LLM controller, tool-use module, and API-Oriented QA: An effective agent should pos-
memory module. With agent.run, the agent can ef- sess knowledge of APIs. Our dataset incorporates
ficiently execute and complete the instruction in API document QA tasks and task planning tasks
a one-step generation. First, the agent retrieves which requires agents to offer appropriate sugges-
query-related tools through the tool retrieval and tions to users on how to use various APIs to solve
combines the retrieved API schema with other con- complex tasks.
textual prompts in memory module, to construct API-Agnostic Instructions: To enhance the
a new instruction prompt. Then, the agent sends agent’s ability to follow common instructions and
this new prompt to the LLM, who plans whether increase user engagement, we have incorporated
and which API to call and generate an API request. both Chinese and English API-agnostic instructions
Next, the agent will execute the selected API with within our dataset. These instructions place greater
the extracted API parameters and return the API emphasis on the agent’s inherent capabilities rather
results to the LLMs, which will continue to plan than reliance on API invocation.
whether to call other APIs. If another API call The data was collected by prompting ChatGPT
is needed, the process is repeated, otherwise, the (gpt-3.5-turbo) to generate instructions, API re-
LLMs generate the final response and the agent quests, and answers based on the API calling re-
returns the final result to the user. sults, more details can be accessed in Appendix D.
agent = AgentExecutor ( llm , tool_cfg ,
additional_tool_list = tool_list ) 3.2 Model Training
agent . run ( " Draw a logo image of agent ")
We use the MSAgent-Bench to fine-tune multi-
ple open-source LLMs, including LLaMA (Tou-
3 Training vron et al., 2023), Qwen (QwenLM, 2023), Chat-
PLUG (Tian et al., 2023) etc. We train all the open-
3.1 Dataset source LLMs in a multi-round conversation mode
To facilitate building an agent with the ability and concatenate all the prompts and answers. Com-
to use tools while upholding an optimal level of 7
https://modelscope.cn/datasets/damo/MSAgent-
user engagement, we release a comprehensive tool Bench/summary
Dataset Language Instance Type # Instances API type Avg. Turn Avg. Step
API-Bank (Li et al., 2023) English Tool Use 264 Common API 3.27 1.92
ToolAlpaca (Tang et al., 2023) English Tool Use 3.9 K Common API 1 1.66
Gorilla (Patil et al., 2023) English Tool Use 16.4 k Model API 1 1
GPT4Tools (Yang et al., 2023) English Tool Use 71.4 K Model API 1 1
ToolBench (Qin et al., 2023) English Tool Use 26.9 K Common API 1 4.1
MSAgent-Bench (ours) English + Chinese Tool Use + Common Chat 598 K Common API + Model API 1.52 1.31
Table 1: The statistics of MSAgent-Bench and other existing tool learning datasets.
Model API API-Oriented QA

• Text-to-Image • Document QA
• Text-to-Video • Task Planning
• Text-to-Audio …
• Translation
•
•
Image Chat
Universal IE
MSAgent- API-Agnostic Instructions
• Story Generation
…
Common API
Bench • Open QA
• Code
• Weather • Chit Chat
• Web Search • Paraphrase
• Calculator • STEM
• Map • Role Play
… …
Figure 2: The instance types and distribution of our collected MSAgent-Bench.
pared to common instruction tuning data, the tool measures the similarity between the generated re-
learning samples focus more heavily on the accu- sponse and the gold answer. Additionally, we intro-
racy of tool selection and API parameter prediction. duce a novel metric called Argument F1 for fully
Therefore, we propose a simple training strategy, evaluating the quality of API requests. To com-
Weighted LM, which enhances the training of gen- pute Argument F1, we categorize the arguments in
eration of API name and parameters, while zero- agent’s API request into two cases, namely Half
out the loss of tokens from the user prompt and the match (HM) and Full match (FM), representing
tool execution. More details can be be referred to correct argument but with wrong value and correct
Appendix B.3. argument with correct value, respectively. Suppose
kwargs = dict ( model = model , ...) the gold argument number in the API is |A|, and
trainer : EpochBasedTrainer = build_trainer the number of arguments in the agents API request
( name = args . trainer , default_args = kwargs ) is |A∗ |, we compute the new Recall and Precision
trainer . train ()
as follows:
4 Evaluation R = (0.5 × # HM + # FM)/|A| (1)

∗
P = (0.5 × # HM + # FM)/|A | (2)
Our evaluation system, MSAgent-Eval, comprises
two modules: an automatic evaluation framework and the final argument F1 is computed as:
which comprehensively evaluates API usability of F 1 = 2(R ∗ P )/(R + P ). (3)
the agents, and a human evaluation framework im-
plemented by an agent arena which reflects the A sample code for the automated evaluation of
preferences of human users. agents is provided below:
from tool_agent_finetune import evaluation
4.1 Automatic Evaluation Framework EM , F1 , ROUGE = evaluation ( refs , preds )
In automatic evaluation, we mainly focus on eval- Expert annotators were engaged to annotate the
uating agent’s ability to generate accurate API re- evaluation instances, with the task of providing
quest and the proper answers according to the API diverse instructions, manually documenting cor-
calling results. Specifically, we use the action ex- rect API calling requests, and writing appropriate
actly match score (Action EM) which measures responses. The statistics of our currently assem-
whether the agent uses the correct API as the ref- bled test data is in Appendix B.1, and the auto-
erence gold API, and the ROUGE-L score which matic evaluation scores of our trained agents can
(a) ModelScope Intelligent Assistant (b) Register and Use New Tools on Alibaba Cloud
Figure 3: Demo cases of ModelScopeGPT based on ModelScope-Agent .
be found in Appendix B.2. We also guarantee the call parameters using information from previous
users to upload their own annotated test examples conversations. More cases can refer to Appendix C.
to accurately evaluate the performance of agents in As a result, ModelScopeGPT has achieved a total
customized scenarios. request number of over 170k from 40k user visits
within one month after its release.
4.2 Human Evaluation with Agent Arena
Inspired by the Arena for ChatBots (Zheng et al., Register and Use New Tools Another key fea-
2023), we have built an accessible Agent Arena 8 ture of an agent is its generalization capability to
that allows users to furnish instructions to two unseen APIs. This allows users to quickly register
anonymous agents, based on the provided APIs. their own APIs and customize their specific applica-
Subsequently, users have the opportunity to vote tions. Therefore, we test the generalization ability
on which Agent performs better in tackling the in- of ModelScopeGPT by applying it to an Alibaba
struction with the given APIs. In accordance with Cloud application scenario. As shown in Figure 3b,
the framework presented by Zheng et al. (2023), we first found an API for renewing an ECS in-
we adopt a system of ELO ratings and leaderboard stance on Alibaba Cloud. Then, we registered the
maintenance for the participating Agents. API schema defined in the tool library to the agent.
Finally, we entered the prompt "Please help me re-
5 Usage Example of ModelScopeGPT new an ECS..." in the demo. The agent generated a
request through planning, selected the appropriate
In this section, we showcase a successful
API, called the API to renew the instance success-
application of ModelScope Community, Mod-
fully, and provided a reply to inform the user that
elScopeGPT9 , based on our ModelScope-Agent.
the renewal was completed. This test demonstrates
ModelScope Intelligent Assistant Based on that the open-source LLM optimized based on the
ModelScope-Agent , we have developed an intel- released API dataset has a strong generalization
ligent assistant for the ModelScope Community, ability towards unseen APIs.
namely ModelScopeGPT. It uses LLMs as a con-
troller to connect dozens of domain-specific AI 6 Conclusion
models in the ModelScope open-source community,
covering NLP, CV, Audio, and Multi-Modal fields. ModelScope-Agent aims to facilitate building AI
To make the pipeline more practical, we have in- Agent applications and research based on open-
cluded API retrieval and knowledge retrieval tool to source LLMs by providing a general and customiz-
automatically select proper APIs and get access to able agent framework covering flexible system de-
the local ModelScope knowledge. As shown in Fig- sign, data collection, model training, evaluation
ure 3a, ModelScopeGPT can support API calls in and usage example in real-world application. It
multi-turn conversations and generate correct API provides an open-source, community-driven library
towards AI Agent learning and best practices for
8
https://modelscope.cn/studios/LLMZOO/Chinese- building an agent system with open-source LLMs.
Arena/summary
9
https://modelscope.cn/studios/damo/ModelScopeGPT We hope ModelScope-Agent can help pave the way
/summary towards a new era of AI Agent.
Ethics Statement Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Maarten Bosma, Gaurav Mishra, Adam Roberts,
Intended Use. ModelScope-Agent is designed Paul Barham, Hyung Won Chung, Charles Sutton,
to facilitate building AI Agent applications and Sebastian Gehrmann, Parker Schuh, Kensen Shi,
Sasha Tsvyashchenko, Joshua Maynez, Abhishek
research based on open-source LLMs, by providing Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-
a general and customizable agent system. odkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob
Potential Misuse. Although we have only trained Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin,
with the tool-use datasets and gone through certain Toju Duke, Anselm Levskaya, Sanjay Ghemawat,
Sunipa Dev, Henryk Michalewski, Xavier Garcia,
data filtering rules, it is still possible that the cus- Vedant Misra, Kevin Robinson, Liam Fedus, Denny
tomized model may generate some biased, fake, Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim,
and unsafe information. Our agent framework also Barret Zoph, Alexander Spiridonov, Ryan Sepassi,
provides users with the freedom to select proper David Dohan, Shivani Agrawal, Mark Omernick, An-
LLMs and upload their own clean data for training. drew M. Dai, Thanumalayan Sankaranarayana Pil-
lai, Marie Pellat, Aitor Lewkowycz, Erica Moreira,
It is also important to design specific methods to Rewon Child, Oleksandr Polozov, Katherine Lee,
improve the safety of the agent framework in the Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark
future. Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov,
and Noah Fiedel. 2022. Palm: Scaling language mod-
eling with pathways. CoRR, abs/2204.02311.
References
Jordan Hoffmann, Sebastian Borgeaud, Arthur Men-
Michael Ahn, Anthony Brohan, Noah Brown, Yev- sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther-
gen Chebotar, Omar Cortes, Byron David, Chelsea ford, Diego de Las Casas, Lisa Anne Hendricks,
Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Johannes Welbl, Aidan Clark, et al. 2022. Train-
Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, ing compute-optimal large language models. arXiv
Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, preprint arXiv:2203.15556.
Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jes-
month, Nikhil J Joshi, Ryan Julian, Dmitry Kalash- Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky
nikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Liang, Pete Florence, Andy Zeng, Jonathan Tompson,
Levine, Yao Lu, Linda Luu, Carolina Parada, Pe- Igor Mordatch, Yevgen Chebotar, Pierre Sermanet,
ter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Tomas Jackson, Noah Brown, Linda Luu, Sergey
Rettinghouse, Diego Reyes, Pierre Sermanet, Nico- Levine, Karol Hausman, and brian ichter. 2023. In-
las Sievers, Clayton Tan, Alexander Toshev, Vincent ner monologue: Embodied reasoning through plan-
Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, ning with language models. In Proceedings of The
Mengyuan Yan, and Andy Zeng. 2022. Do as i can, 6th Conference on Robot Learning, volume 205 of
not as i say: Grounding language in robotic affor- Proceedings of Machine Learning Research, pages
dances. arXiv preprint arXiv:2204.01691. 1769–1782. PMLR.
Huggingface. 2023. Transformers agent. Website.
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al-
https://huggingface.co/docs/transformers/
shamsi, Alessandro Cappelli, Ruxandra Cojocaru,
transformers_agents.
Merouane Debbah, Etienne Goffinet, Daniel Heslow,
Julien Launay, Quentin Malartic, et al. 2023. Falcon- Minghao Li, Feifan Song, Bowen Yu, Haiyang Yu,
40b: an open large language model with state-of-the- Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-
art performance. bank: A benchmark for tool-augmented llms. arXiv
preprint arXiv:2304.08244.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Adam Roberts, Stella Biderman, Teven Le Scao,
Askell, Sandhini Agarwal, Ariel Herbert-Voss, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey
Gretchen Krueger, Tom Henighan, Rewon Child, Schoelkopf, et al. 2022. Crosslingual generaliza-
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, tion through multitask finetuning. arXiv preprint
Clemens Winter, Christopher Hesse, Mark Chen, Eric arXiv:2211.01786.
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
OpenAI. 2022. Chatgpt: Optimizing language models
Jack Clark, Christopher Berner, Sam McCandlish,
for dialogue.
Alec Radford, Ilya Sutskever, and Dario Amodei.
2020. Language models are few-shot learners. In Ad- OpenAI. 2023. GPT-4 technical report. CoRR,
vances in Neural Information Processing Systems 33: abs/2303.08774.
Annual Conference on Neural Information Process-
ing Systems 2020, NeurIPS 2020, December 6-12, Shishir G. Patil, Tianjun Zhang, Xin Wang, and
2020, virtual. Joseph E. Gonzalez. 2023. Gorilla: Large language
model connected with massive apis. arXiv preprint Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge,
arXiv:2305.15334. Xiu Li, and Ying Shan. 2023. Gpt4tools: Teaching
large language model to use tools via self-instruction.
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, arXiv preprint arXiv:2305.18752.
Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang,
Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang,
Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Joseph E. Gonzalez, and Ion Stoica. 2023. Judg-
Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, ing llm-as-a-judge with mt-bench and chatbot arena.
Yaxi Lu, Weilin Zhao, Yuxiang Huang, Junxi Yan, arXiv preprint arXiv:2306.05685.
Xu Han, Xian Sun, Dahai Li, Jason Phang, Cheng
Yang, Tongshuang Wu, Heng Ji, Zhiyuan Liu, and
Maosong Sun. 2023. Tool learning with foundation
models. arXiv preprint arXiv:2304.08354.
QwenLM. 2023. Qwen-7b.
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie
Millican, Jordan Hoffmann, Francis Song, John
Aslanides, Sarah Henderson, Roman Ring, Susan-
nah Young, et al. 2021. Scaling language models:
Methods, analysis & insights from training gopher.
arXiv preprint arXiv:2112.11446.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta
Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola
Cancedda, and Thomas Scialom. 2023. Toolformer:
Language models can teach themselves to use tools.
arXiv preprint arXiv:2302.04761.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li,
Weiming Lu, and Yueting Zhuang. 2023. Hugging-
gpt: Solving ai tasks with chatgpt and its friends in
hugging face. arXiv preprint arXiv:2303.17580.
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han,
Qiao Liang, and Le Sun. 2023. Toolalpaca: Gener-
alized tool learning for language models with 3000
simulated cases. arXiv preprint arXiv:2306.05301.
THUDM. 2023. Chatglm. https://github.com/
THUDM/ChatGLM-6B.
Junfeng Tian, Hehong Chen, Guohai Xu, Ming Yan,
Xing Gao, Jianhai Zhang, Chenliang Li, Jiayi Liu,
Wenshen Xu, Haiyang Xu, Qi Qian, Wei Wang, Qing-
hao Ye, Jiejing Zhang, Ji Zhang, Fei Huang, and
Jingren Zhou. 2023. Chatplug: Open-domain gen-
erative dialogue system with internet-augmented in-
struction tuning for digital human. arXiv preprint
arXiv:2304.07849.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open
and efficient foundation language models. arXiv
preprint arXiv:2302.13971.
Chenfei Wu, Shengming Yin, Weizhen Qi, Xi-
aodong Wang, Zecheng Tang, and Nan Duan.
2023. Visual chatgpt: Talking, drawing and edit-
ing with visual foundation models. arXiv preprint
arXiv:2303.04671.
A Library dataset consisted of 360 conversations with 2059
text snippets as the references to be compared with
A.1 Tool List
the agent prediction, which comprise 798 API re-
API Name (language) Description Type
qusts and 1261 plain text answers according to the
Text-to-Image(en) Converts text to an image. Model API previous calling results.
Text-to-Image(zh) Converts text to an image. Model API
Text-to-Video(en) Converts text to a video. Model API
Text-to-Audio(en) Converts text to audio. Model API
Text-to-Audio(zh) Converts text to audio. Model API
Image-Chat(en) Image chat. Model API B.2 Evaluation Results
Translation-zh2en Translates Chinese text to English. Model API
Translation-en2zh Translates English text to Chinese. Model API
Universal-IE(zh) Extracts structured information. Model API
Text-to-Geographic(zh) Extracts geographic information. Model API
NER(zh) Recognizes named entities in text. Model API Model ROUGE-L Action EM Argument F1
API-Retrieval Retrieves relevant APIs Common API ChatGPT (2-shot)∗ 36.70 34.82 25.51
ModelScope-Retrieval Retrieves modelscope docs. Common API LLaMA 39.16 58.60 44.98
ChatPLUG11 46.45 68.29 55.12
MSAgent-Qwen12 51.35 87.23 68.09
Table 2: The statistics of default tool list. Supported
input languages for the APIs are listed in parentheses. Table 3: Automatic evaluation results. ∗ represents that
we do not fine-tune ChatGPT but use in-context learning
A.2 CustomTool with 2 demonstrations.
User can customize their own tools by inheriting a

We compare the models trained in our proposed
base tool and defining the tool names, descriptions,
ModelScopeGPT. The automaction evaluation re-
and parameters according to a pre-defined schema.
sults are shown in Table 3. Based on the findings
Moreover, you can implement _local_call() or _re-
obtained from our experimentation, it is evident
mote_call() depending on your specific require-
that ChatGPT with in-context learning yielded infe-
ments. To illustrate, below is an example of a
rior results as compared to other models that were
custom tool:
subjected to finetuning. Furthermore, LLaMA un-
derperformed when compared to other finetuned
class CustomTool ( Tool ): models. Our error study revealed that the lower
description = ’ xxx ’ performance of ChatGPT and LLaMA could be at-
name = ’ xxx ’
parameters : list = [{
tributed to a large proportion of Chinese test cases
’ name ’: ’ xxx ’, in our test set. The models (ChatPLUG, Qwen) that
’ description ’: ’ xxx ’, performed better were those that predominantly fo-
’ required ’: True
}] cused on Chinese data. Our investigation revealed
that ChatGPT and LLaMA exhibited limitations
def _local_call (): in user intent recognition, which ultimately led
...
to their suboptimal performance on Action EM.
def _remote_call (): Among the models examined, Qwen displayed the
... most favorable performance, which could be at-
tributed to the superior performance of its basic
B Experiment Setup model.
B.1 Evaluation Benchmark
To assess the generalization of the trained agent, B.3 Weighted LM
we include 10 in-domain APIs that appear in the
We give an example of the training strategy
training set of ModelScope-Agent and 10 real un-
Weighted LM. As show in Figure 4, tokens with
seen APIs10 . We also account for the multi-turn
different colors have different loss weights. For the
ability of the agent by annotating several multi-turn
user input prompt, we set the loss weight to 0, so
scenarios in our evaluation benchmark. Our test
that the model does not calculate the loss for the
instances were annotated by asking the human ex-
prompt. For the API-Agnostic text of the assistant,
perts to write diverse instructions first. Then the
we keep the loss weight as 1. Finally, for the im-
human experts were ask to write the JSON API
portant text of the API calling, such as API name,
request and answer the instructions properly after
parameters, URL, etc., we set the loss weight to 2,
obtaining the API calling results. Our final testing
which can improve the generation accuracy of API
10
In progress, we will include more APIs in the future. calling.
Figure 4: Example of training strategy for weighted LM. Different colored tokens have different loss weights.
Figure 5: Single-step tool-use instructions, text-to-video cases. We have captured a few frames of the video to
display. Testing the model using the same semantic instruction in both English (left) and Chinese (right).
Figure 6: Single-step tool-use instructions, image-chat cases. Testing the model using the same semantic instruction
in both English (left) and Chinese (right).
C Cases tional copy first, then read it, and finally generate a
video. These instructions require the model to have
In this section, we show the qualitative results the ability of multi-step Tool use. In the Chinese
about ModelScopeGPT implementation based on case, our model accurately completed the three-
ModelScope-Agent. step tool use.
Single-step Tool Use As shown in Figure 5 and
6, the instruction expects the model to generate a
video and chat about the image respectively. These Multi-turn Tool Use As shown in Figure 8, the
instructions can be completed with a single step of instruction requires the model to have the ability to
tool use. multi-turn conversation and use the history conver-
sation. Our model can accurately call the API and
Multi-step Tool Use As shown in Figure 7, the capture the content of the previous conversation to
instruction expects the model to write the promo- generate API parameters.
Figure 7: Multi-step tool-use instructions. We have captured a few frames of the video to display. Testing the model
using the same semantic instruction in both English(left) and Chinese(right).
Figure 8: Multi-turn tool-use instructions, text-to-speech and text-to-image cases. Testing the model using the same
semantic instruction in both English(left) and Chinese(right).
Figure 9: Multi-turn tool-use instructions, text-to-speech and text-to-image cases. Testing the model using the same
semantic instruction in both English(left) and Chinese(right).
In-domain Knowledge QA As shown in Figure domain knowledge and use the retrieved knowledge
9, the instruction requires the model to retrieve in-
to answer questions. loop would continue until the agent determined
that it was appropriate to terminate the conversa-
tion with the final answer. After acquiring the raw
as User dataset, we applied filtering mechanisms to elim-
inate instances in which ChatGPT generated API
requests containing hallucinated API names and
Instruction or Follow-up or parameters that were absent from the retrieved API.
Clarification Final Answer
Additionally, we excluded instances in which Chat-
GPT generated illegal API requests, thus resulting
in a refined and finalized dataset.
as Agent
As introduced in Section 3.1, we collect in-
stances across different languages and topics, the
detailed statistics of our collected data are shown
API request in Table 4.
Result
Instance Type # Instances
API Gallery Chinese 532,436
English 66,444
Common API 211,026
Model API 58,338
API-Oriented QA 5,000
API-Agnostic Instruction 329,776
Figure 10: The data collection procedure of MSAgent-
Bench. Table 4: The statistics of our collected dataset.
D Data Collection Procedure

E Related Work
We collected our dataset by using prompt engineer
to simulate the agent scenarios with two ChatG- E.1 Large Language Models
PTs (gpt-3.5-turbo). One of the ChatGPTs was Recent years have witnessed rapid development in
prompted to act as the user, while the other was the field of Large Language Models (LLMs). Typ-
assigned to act as the agent. In order to expand ical models, such as GPT3 (Brown et al., 2020),
the domains and functionalities of APIs presented Gopher (Rae et al., 2021), Chinchilla (Hoffmann
in the training data, rather than the exsisting real et al., 2022), PaLM (Chowdhery et al., 2022) and
APIs, we also included a number of synthetic APIs LLaMA (Touvron et al., 2023), have shown im-
that were generated by ChatGPT. When these syn- pressive zero and few-shot generalization abilities
thetic APIs were incorporated into the dialogues, on a wide range of NLP tasks, by scaling up the
we prompted another ChatGPT to serve as the API model and data size. A remarkable milestone is the
and return the relevant calling outcomes. release of ChatGPT (OpenAI, 2022) or GPT4 (Ope-
The data collection procedure is shown in Fig- nAI, 2023), which has greatly revolutionized the
ure 10. Initially, a set of random in-context demon- paradigm of AI development. As a result, a rising
strations were provided to ChatGPT for generating trend of open-source LLMs has emerged to chal-
an instruction. This instruction could either be a lenge and catch up their closed-source counterparts
regular one or one that requires solving with APIs, like ChatGPT and Claude, such as BLOOM (Muen-
depending on the demonstrations provided. Subse- nighoff et al., 2022), LLaMA (Touvron et al.,
quently, ChatGPT was prompt to act as an agent by 2023), Falcon (Almazrouei et al., 2023), Chat-
first thinking about which action to undertake. If GLM (THUDM, 2023). Despite the great break-
no API calls were deemed necessary, or if the user through, LLMs are trained as text generators over
clarification is needed, the agent would respond plain text corpora, thus performing less well on
with a follow-up response to the user. Otherwise other tasks such as multi-modal tasks. It also falls
the agent will send API request to the API gallery. short on tasks that require up-to-date information,
After receiving the result of the API call, the agent which are beyond the pretraining data. Using tools
would assess the situation and decide on the next ac- or external APIs can help overcome the limitations
tion. This iterative process of the "user-agent-API" and harnesses the power of LLMs to facilitate seam-
less connections with downstream applications. In
ModelScope-Agent , we provide the whole cus-
tomizable framework and best practices for build-
ing an agent system, which enables open-source
LLMs to use tools and external APIs.
E.2 Agent & Tool Learning

The utilization of Large Language Models (LLMs)
as a controller to construct an agent system has
emerged as a prominent research area. Several re-
lated works employ prompt engineering techniques
on closed-source LLMs, such as ChatGPT (Ope-
nAI, 2022) and Claude, to enable their applica-
tion in specific domains. For instance, Visual-
ChatGPT (Wu et al., 2023) and HuggingGPT (Shen
et al., 2023) facilitate the HuggingFace model call-
ings accessible to OpenAI LLMs. SayCan (Ahn
et al., 2022) and inner monologue (Huang et al.,
2023) integrate LLMs with robots to achieve
robotic systems. Notably, recent works such
as Langchain and Auto-GPT encompass a wide
range of tools, including common APIs and neu-
ral models, and enhance long-term reasoning
and human-agent interaction whilst solving tasks,
which demonstrate the immense potential for build-
ing a generalized agent.
Numerous endeavors have also been made
to enable open-source LLMs to utilize tools.
For instance, Gorilla (Patil et al., 2023) and
GPT4Tools (Yang et al., 2023) generate training
data using self-instruction techniques to train open-
source LLMs to effectively utilize neural mod-
els. ToolAlpaca (Tang et al., 2023) and ToolL-
LaMA (Qin et al., 2023) train LLAMA using com-
mon APIs, with the distinction that ToolAlpaca
employs synthetic APIs from LLMS, whereas Tool-
LLaMA utilizes real APIs.
Overall, compared to the above-mentioned meth-
ods, ModelScope-Agent differs in the following
aspects. Firstly, our method includes a universal
training framework that supports user-customized
agent learning for open-source models to meet in-
dustrial needs. Secondly, ModelScope-Agent can
support various APIs in different fields, including
model APIs and common APIs, while previous
works only support certain specific APIs.

Modelscope-Agent: Building Your Customizable Agent System With Open-Source Large Language Models

Uploaded by

Copyright:

Available Formats

Modelscope-Agent: Building Your Customizable Agent System With Open-Source Large Language Models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modelscope-Agent: Building Your Customizable Agent System With Open-Source Large Language Models

Uploaded by

Copyright:

Available Formats

ModelScope-Agent: Building Your Customizable Agent System with

Open-source Large Language Models

Abstract spite the rapid advancements of open-source LLMs,

Model API API-Oriented QA

Figure 2: The instance types and distribution of our collected MSAgent-Bench.

4 Evaluation R = (0.5 × # HM + # FM)/|A| (1)

Figure 3: Demo cases of ModelScopeGPT based on ModelScope-Agent .

User can customize their own tools by inheriting a

D Data Collection Procedure

E.2 Agent & Tool Learning

You might also like