Abstract
The developments of human-robot interaction (HRI) and Large Language Models (LLMs) have paved the way for a wide range of robotics applications spanning from industrial automation to service robotics. Although large language models (LLMs) have demonstrated impressive capabilities, their application in robotics is hindered by a critical limitation: the absence of real-world memory and common sense. This deficiency makes it challenging for robots to comprehend multi-turn instructional commands. For instance, a command such as ‘Remind me to take medicine tomorrow morning’ could lead to ambiguity, as the model may struggle to determine whether this is indeed an instruction and whether additional arguments are required. Moreover, there is a substantial gap in the literature regarding comparative studies on the efficacy of prompt engineering versus supervised fine-tuning for tasks that involve the invocation of robot instructions based on LLMs. Addressing this gap is essential for advancing the integration of LLMs into practical robotics systems and improving human-robot interaction capabilities.
In this study, we present a novel multi-turn instruction invocation framework designed to address the challenges of multi-turn instruction invocation and other dialogue-related tasks in robotics. Using a real-world robot dataset, we conduct a comprehensive evaluation of various large-scale models to assess their performance in terms of instruction invocation. This systematic comparison enables us to identify the strengths and limitations of existing approaches and provide insights into the development of more effective robotics systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ahn, M., et al.: Do as i can, not as i say: grounding language in robotic affordances 0, 287–318 (2022)
Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Bubeck, S., et al.: Sparks of artificial general intelligence: early experiments with GPT-4 abs/2303.12712 (2023)
Driess, D., et al.: Palm-e: an embodied multimodal language model, abs/2303.03378, 8469–8488 (2023)
Du, Z., et al.: GLM: general language model pretraining with autoregressive blank infilling, pp. 320–335 (2022)
Gugger, S., et al.: Accelerate: training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate (2022)
Holland, J., et al.: Service robots in the healthcare sector 10, 47 (2021)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=nZeVKeeFYf9
Huang, W., et al.: Inner monologue: embodied reasoning through planning with language models, pp. 1769–1782 (2022)
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing 55, 195:1–195:35 (2021)
Motahar, T., Farden, M.F., Islam, M.A., Rony, R.J., Sarkar, D.P.: Mini nurse-bot: a healthcare assistance for elderly people, pp. 170–173 (2018)
Qin, Y., et al.: Toolllm: facilitating large language models to master 16000+ real-world APIS. abs/2307.16789 (2023)
Shah, D., Osinski, B., Ichter, B., Levine, S.: LM-NAV: robotic navigation with large pre-trained models of language, vision, and action, pp. 492–504 (2022)
Touvron, H., et al.: Llama: open and efficient foundation language models (2023)
Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models (2023)
Wang, Y., et al.: Self-instruct: aligning language model with self generated instructions. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508 (2022)
Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020). https://www.aclweb.org/anthology/2020.emnlp-demos.6
Xiao, S., Liu, Z., Zhang, P., Muennighof, N.: C-pack: packaged resources to advance general Chinese embedding, abs/2309.07597 (2023)
Ye, Y., Cong, X., Qin, Y., Lin, Y., Liu, Z., Sun, M.: Large language model as autonomous decision maker, abs/2308.12519 (2023)
Zeng, F., Gan, W., Wang, Y., Liu, N., Yu, P.S.: Large language models for robotics: a survey. abs/2311.07226 (2023)
Zhao, W.X., et al.: A survey of large language models, abs/2303.18223 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Cheng, B. et al. (2025). Multi-turn Instruction Invocation on Human-Robot Interaction by Large Language Models. In: Lan, X., Mei, X., Jiang, C., Zhao, F., Tian, Z. (eds) Intelligent Robotics and Applications. ICIRA 2024. Lecture Notes in Computer Science(), vol 15207. Springer, Singapore. https://doi.org/10.1007/978-981-96-0780-8_15
Download citation
DOI: https://doi.org/10.1007/978-981-96-0780-8_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0779-2
Online ISBN: 978-981-96-0780-8
eBook Packages: Computer ScienceComputer Science (R0)