Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Public Access

Text2Motion: from natural language instructions to feasible plans

Published: 14 November 2023 Publication History

Abstract

We propose Text2Motion, a language-based planning framework enabling robots to solve sequential manipulation tasks that require long-horizon reasoning. Given a natural language instruction, our framework constructs both a task- and motion-level plan that is verified to reach inferred symbolic goals. Text2Motion uses feasibility heuristics encoded in Q-functions of a library of skills to guide task planning with Large Language Models. Whereas previous language-based planners only consider the feasibility of individual skills, Text2Motion actively resolves geometric dependencies spanning skill sequences by performing geometric feasibility planning during its search. We evaluate our method on a suite of problems that require long-horizon reasoning, interpretation of abstract goals, and handling of partial affordance perception. Our experiments show that Text2Motion can solve these challenging problems with a success rate of 82%, while prior state-of-the-art language-based planning methods only achieve 13%. Text2Motion thus provides promising generalization characteristics to semantically diverse sequential manipulation tasks with geometric dependencies between skills. Qualitative results are made available at https://sites.google.com/stanford.edu/text2motion.

References

[1]
Aeronautiques, C., Howe, A., Knoblock, C., McDermott, I. D., Ram, A., Veloso, M., Weld, D., SRI, D. W., Barrett, A., Christianson, D., et al. (1998). PDDL| the planning domain definition language. Technical Report.
[2]
Agia, C., Migimatsu, T., Wu, J., & Bohg, J. (2022). STAP: Sequencing task-agnostic policies. arXiv preprint arXiv:2210.12250
[3]
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. (2022). Do as i can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691
[4]
Ames, B., Thackston, A., & Konidaris, G. (2018). Learning symbolic representations for planning with parameterized skills. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 526–533). IEEE.
[5]
Bidot J, Karlsson L, Lagriffoul F, and Saffiotti A Geometric backtracking for combined task and motion planning in robotic systems Artificial Intelligence 2017 247 229-265
[6]
Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258
[7]
Bonet B and Geffner H Planning as heuristic search Artificial Intelligence 2001 129 1–2 5-33
[8]
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.-H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J., Perez, E., Pertsch, K., Quiambao, J., Rao, K., Ryoo, M., Salazar, G., Sanketi, P., Sayed, K., Singh, J., Sontakke, S., Stone, A., Tan, C., Tran, H., Vanhoucke, V., Vega, S., Vuong, Q., Xia, F., Xiao, T., Xu, P., Xu, S., Yu, T., & Zitkovich, B. (2022). Rt-1: Robotics transformer for real-world control at scale. In: arXiv Preprint arXiv:2212.06817
[9]
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners Advances in Neural Information Processing Systems 2020 33 1877-1901
[10]
Chen, B., Xia, F., Ichter, B., Rao, K., Gopalakrishnan, K., Ryoo, M.S., Stone, A., & Kappler, D. (2022a). Open-vocabulary queryable scene representations for real world planning. arXiv preprint arXiv:2209.09874
[11]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H.P.d.O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
[12]
Chen, Y., Yuan, L., Cui, G., Liu, Z., & Ji, H. (2022b). A close look into the calibration of pre-trained language models. arXiv preprint arXiv:2211.00151
[13]
Chitnis, R., Silver, T., Kim, B., Kaelbling, L., & Lozano-Perez, T. (2021). Camps: Learning context-specific abstractions for efficient planning in factored MDPS. In Conference on robot learning (pp. 64–79). PMLR.
[14]
Chitnis, R., Silver, T., Tenenbaum, J. B., Lozano-Perez, T., & Kaelbling, L. P. (2022). Learning neuro-symbolic relational transition models for bilevel planning. In 2022 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 4166–4173). IEEE.
[15]
Curtis, A., Fang, X., Kaelbling, L. P., Lozano-Pérez, T., & Garrett, C. R. (2022). Long-horizon manipulation of unknown objects via task and motion planning with estimated affordances. In 2022 International Conference on Robotics and Automation (ICRA) (pp. 1940–1946). IEEE.
[16]
Curtis, A., Silver, T., Tenenbaum, J. B., Lozano-Pérez, T., & Kaelbling, L. (2022). Discovering state and action abstractions for generalized task and motion planning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, pp. 5377–5384).
[17]
Dalal, M., Mandlekar, A., Garrett, C., Handa, A., Salakhutdinov, R., & Fox, D. (2023). Imitating task and motion planning with visuomotor transformers. arXiv preprint arXiv:2305.16309
[18]
Dantam, N. T., Kingston, Z. K., Chaudhuri, S., & Kavraki, L. E. (2016). Incremental task and motion planning: A constraint-based approach. In Robotics: Science and systems, Ann Arbor, Michigan.
[19]
Driess, D., Ha, J.-S., & Toussaint, M. (2020a). Deep visual reasoning: Learning to predict action sequences for task and motion planning from an initial scene image. arXiv preprint arXiv:2006.05398
[20]
Driess D, Ha J-S, and Toussaint M Learning to solve sequential physical reasoning problems from a scene image The International Journal of Robotics Research 2021 40 12–14 1435-1466
[21]
Driess, D., Ha, J.-S., Tedrake, R., & Toussaint, M. (2021b). Learning geometric reasoning and control for long-horizon tasks from visual input. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 14298–14305). IEEE.
[22]
Driess, D., Huang, Z., Li, Y., Tedrake, R., & Toussaint, M. (2023). Learning multi-object dynamics with compositional neural radiance fields. In Proceedings of the 6th conference on robot learning. Proceedings of machine learning research (Vol. 205, pp. 1755–1768). PMLR.
[23]
Driess, D., Oguz, O., Ha, J.-S., Toussaint, M. (2020b). Deep visual heuristics: Learning feasibility of mixed-integer programs for manipulation planning. In 2020 IEEE international conference on robotics and automation (ICRA) (pp. 9563–9569). IEEE.
[24]
Driess, D., Oguz, O., Toussaint, M. (2019). Hierarchical task and motion planning using logic-geometric programming (HLGP). In RSS workshop on robust task and motion planning.
[25]
Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al. (2023). Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378
[26]
Felip J, Laaksonen J, Morales A, and Kyrki V Manipulation primitives: A paradigm for abstraction and execution of grasping and manipulation tasks Robotics and Autonomous Systems 2013 61 3 283-296
[27]
Garrett CR, Chitnis R, Holladay R, Kim B, Silver T, Kaelbling LP, and Lozano-Pérez T Integrated task and motion planning Annual Review of Control, Robotics, and Autonomous Systems 2021 4 265-293
[28]
Garrett, C. R., Lozano-Pérez, T., & Kaelbling, L. P. (2020). Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning. In Proceedings of the international conference on automated planning and scheduling (Vol. 30, pp. 440–448).
[29]
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In J. Dy, & A. Krause (Eds.), Proceedings of the 35th international conference on machine learning. Proceedings of machine learning research (Vol. 80, pp. 1861–1870). PMLR. https://proceedings.mlr.press/v80/haarnoja18b.html
[30]
Helmert M The fast downward planning system Journal of Artificial Intelligence Research 2006 26 191-246
[31]
Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022). Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207
[32]
Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al. (2022). Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608
[33]
Jang, E., Irpan, A., Khansari, M., Kappler, D., Ebert, F., Lynch, C., Levine, S., & Finn, C. (2021). BC-z: Zero-shot task generalization with robotic imitation learning. In 5th annual conference on robot learning. https://openreview.net/forum?id=8kbp23tSGYv
[34]
Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Fei-Fei, L., Anandkumar, A., Zhu, Y., & Fan, L. (2022). Vima: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094
[35]
Kaelbling, L. P., & Lozano-Pérez, T. (2011). Hierarchical task and motion planning in the now. In 2011 IEEE international conference on robotics and automation (pp. 1470–1477).
[36]
Kaelbling, L. P., & Lozano-Pérez, T. (2012). Integrated robot task and motion planning in the now. Technical report: Massachusetts Inst of Tech Cambridge Computer Science and Artificial.
[37]
Kalashnkov, D., Varley, J., Chebotar, Y., Swanson, B., Jonschkowski, R., Finn, C., Levine, S., & Hausman, K. (2021). Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv
[38]
Khatib O A unified approach for motion and force control of robot manipulators: The operational space formulation IEEE Journal on Robotics and Automation 1987 3 1 43-53
[39]
Kim, B., Kaelbling, L. P., & Lozano-Pérez, T. (2019). Adversarial actor-critic method for task and motion planning problems using planning experience. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 8017–8024).
[40]
Kim B, Shimanuki L, Kaelbling LP, and Lozano-Pérez T Representation, learning, and planning algorithms for geometric task and motion planning The International Journal of Robotics Research 2022 41 2 210-231
[41]
Konidaris G, Kaelbling LP, and Lozano-Perez T From skills to symbols: Learning symbolic representations for abstract high-level planning Journal of Artificial Intelligence Research 2018 61 215-289
[42]
Kroemer, O., & Sukhatme, G. S. (2016). Learning spatial preconditions of manipulation skills using random forests. In 2016 IEEE-RAS 16th international conference on humanoid robots (Humanoids) (pp. 676–683). IEEE.
[43]
Lagriffoul F, Dimitrov D, Bidot J, Saffiotti A, and Karlsson L Efficiently combining task and motion planning using geometric constraints The International Journal of Robotics Research 2014 33 14 1726-1747
[44]
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems (Vol. 30).
[45]
Li, X.L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T., Zettlemoyer, L., & Lewis, M. (2022). Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097
[46]
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., & Zeng, A. (2022). Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753
[47]
Liu, B., Jiang, Y., Zhang, X., Liu, Q., Zhang, S., Biswas, J., & Stone, P. (2023). Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477
[48]
Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic gradient descent with warm restarts. In International conference on learning representations. https://openreview.net/forum?id=Skq89Scxx
[49]
Mees O, Hermann L, Rosete-Beas E, and Burgard W Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks IEEE Robotics and Automation Letters (RA-L) 2022 7 3 7327-7334
[50]
OpenAI: GPT-4 Technical Report (2023)
[51]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155
[52]
Rubinstein R The cross-entropy method for combinatorial and continuous optimization Methodology and Computing in Applied Probability 1999 1 2 127-190
[53]
Shao L, Migimatsu T, Zhang Q, Yang K, and Bohg J Concept2robot: Learning manipulation concepts from instructions and human demonstrations The International Journal of Robotics Research 2021 40 12–14 1419-1434
[54]
Shridhar, M., Manuelli, L., & Fox, D. (2022a). Cliport: What and where pathways for robotic manipulation. In: Conference on robot learning (pp. 894–906). PMLR.
[55]
Shridhar, M., Manuelli, L., & Fox, D. (2022b). Perceiver-actor: A multi-task transformer for robotic manipulation. arXiv preprint arXiv:2209.05451
[56]
Silver, T., Athalye, A., Tenenbaum, J. B., Lozano-Pérez, T., & Kaelbling, L. P. (2022). Learning neuro-symbolic skills for bilevel planning. In 6th annual conference on robot learning. https://openreview.net/forum?id=OIaJRUo5UXy.
[57]
Silver, T., Chitnis, R., Curtis, A., Tenenbaum, J. B., Lozano-Perez, T., & Kaelbling, L. P. (2021). Planning with learned object importance in large problem instances using graph neural networks. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, pp. 11962–11971).
[58]
Silver, T., Chitnis, R., Tenenbaum, J., Kaelbling, L. P., & Lozano-Pérez, T. (2021). Learning symbolic operators for task and motion planning. In 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 3182–3189). IEEE.
[59]
Silver, T., Hariprasad, V., Shuttleworth, R. S., Kumar, N., Lozano-Pérez, T., & Kaelbling, L. P. (2022). PDDL planning with pretrained large language models. In NeurIPS 2022 foundation models for decision making workshop.
[60]
Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., & Garg, A. (2022). Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302
[61]
Skreta, M., Yoshikawa, N., Arellano-Rubach, S., Ji, Z., Kristensen, L.B., Darvish, K., Aspuru-Guzik, A., Shkurti, F., & Garg, A. (2023). Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting. arXiv preprint arXiv:2303.14100
[62]
Stepputtis S, Campbell J, Phielipp M, Lee S, Baral C, and Ben Amor H Language-conditioned imitation learning for robot manipulation tasks Advances in Neural Information Processing Systems 2020 33 13139-13150
[63]
Toussaint, M. (2015). Logic-geometric programming: An optimization-based approach to combined task and motion planning. In Twenty-fourth international joint conference on artificial intelligence.
[64]
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
[65]
Valmeekam, K., Olmo, A., Sreedharan, S., & Kambhampati, S. (2022). Large language models still can’t plan (a benchmark for LLMS on planning and reasoning about change). arXiv preprint arXiv:2206.10498
[66]
Vemprala, S., Bonatti, R., Bucker, A., & Kapoor, A. (2023). Chatgpt for robotics: Design principles and model abilities. Technical Report MSR-TR-2023-8, Microsoft
[67]
Wang, Z., Cai, S., Liu, A., Ma, X., & Liang, Y. (2023). Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560
[68]
Wang, Z., Garrett, C. R., Kaelbling, L. P., & Lozano-Pérez, T. (2018). Active model learning and diverse action sampling for task and motion planning. In 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 4107–4114). IEEE.
[69]
Wang Z, Garrett CR, Kaelbling LP, and Lozano-Pérez T Learning compositional models of robot skills for task and motion planning The International Journal of Robotics Research 2021 40 6–7 866-894
[70]
Williams, G., Aldrich, A., & Theodorou, E. (2015). Model predictive path integral control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149
[71]
Wu, J., Antonova, R., Kan, A., Lepert, M., Zeng, A., Song, S., Bohg, J., Rusinkiewicz, S., & Funkhouser, T. (2023). Tidybot: Personalized robot assistance with large language models. arXiv preprint arXiv:2305.05658
[72]
Xu, D., Mandlekar, A., Martín-Martín, R., Zhu, Y., Savarese, S., & Fei-Fei, L. (2021). Deep affordance foresight: Planning through what can be done in the future. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 6206–6213). IEEE.
[73]
Zelikman, E., Huang, Q., Poesia, G., Goodman, N. D., & Haber, N. (2022). Parsel: A unified natural language framework for algorithmic reasoning. arXiv preprint arXiv:2212.10561
[74]
Zeng, A., Wong, A., Welker, S., Choromanski, K., Tombari, F., Purohit, A., Ryoo, M., Sindhwani, V., Lee, J., Vanhoucke, V., et al. (2022). Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598
[75]
Zhao, Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Calibrate before use: Improving few-shot performance of language models. In International conference on machine learning (pp. 12697–12706). PMLR.
[76]
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., & Gao, J. (2020). Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, pp. 13041–13049).

Cited By

View all
  • (2024)Text-Guided Synthesis of Crowd AnimationACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657516(1-11)Online publication date: 13-Jul-2024
  • (2024)Generative Expressive Robot Behaviors using Large Language ModelsProceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction10.1145/3610977.3634999(482-491)Online publication date: 11-Mar-2024
  • (2024)A survey on integration of large language models with intelligent robotsIntelligent Service Robotics10.1007/s11370-024-00550-517:5(1091-1107)Online publication date: 1-Sep-2024
  • Show More Cited By

Index Terms

  1. Text2Motion: from natural language instructions to feasible plans
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Autonomous Robots
    Autonomous Robots  Volume 47, Issue 8
    Dec 2023
    598 pages

    Publisher

    Kluwer Academic Publishers

    United States

    Publication History

    Published: 14 November 2023
    Accepted: 31 July 2023
    Received: 02 May 2023

    Author Tags

    1. Long-horizon planning
    2. Robot manipulation
    3. Large language models

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Text-Guided Synthesis of Crowd AnimationACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657516(1-11)Online publication date: 13-Jul-2024
    • (2024)Generative Expressive Robot Behaviors using Large Language ModelsProceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction10.1145/3610977.3634999(482-491)Online publication date: 11-Mar-2024
    • (2024)A survey on integration of large language models with intelligent robotsIntelligent Service Robotics10.1007/s11370-024-00550-517:5(1091-1107)Online publication date: 1-Sep-2024
    • (2024)ShapeLLM: Universal 3D Object Understanding for Embodied InteractionComputer Vision – ECCV 202410.1007/978-3-031-72775-7_13(214-238)Online publication date: 29-Sep-2024

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media