Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3563657.3596138acmconferencesArticle/Chapter ViewAbstractPublication PagesdisConference Proceedingsconference-collections
research-article
Open access

Herding AI Cats: Lessons from Designing a Chatbot by Prompting GPT-3

Published: 10 July 2023 Publication History

Abstract

Prompting Large Language Models (LLMs) is an exciting new approach to designing chatbots. But can it improve LLM’s user experience (UX) reliably enough to power chatbot products? Our attempt to design a robust chatbot by prompting GPT-3/4 alone suggests: not yet. Prompts made achieving “80%” UX goals easy, but not the remaining 20%. Fixing the few remaining interaction breakdowns resembled herding cats: We could not address one UX issue or test one design solution at a time; instead, we had to handle everything everywhere all at once. Moreover, because no prompt could make GPT reliably say “I don’t know” when it should, the user-GPT conversations had no guardrails after a breakdown occurred, often leading to UX downward spirals. These risks incentivized us to design highly prescriptive prompts and scripted bots, counter to the promises of LLM-powered chatbots. This paper describes this case study, unpacks prompting’s fickleness and its impact on UX design processes, and discusses implications for LLM-based design methods and tools.

References

[1]
2022. CHATGPT: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/
[2]
Bon Appétit. 2018. Elizabeth Olsen Tries to Keep Up with a Professional Chef | Back-to-Back Chef | Bon Appétit. Youtube. https://www.youtube.com/watch?v=Om2oM-TDErQ
[3]
Isaac Asimov. 1941. Three laws of robotics. Asimov, I. Runaround 2 (1941).
[4]
Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V. Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli, Zhiqing Sun, Srulik Ben-David, Canwen Xu, Gunjan Chhablani, Han Wang, Jason Alan Fries, Maged S. Al-shaibani, Shanya Sharma, Urmish Thakker, Khalid Almubarak, Xiangru Tang, Dragomir Radev, Mike Tian-Jian Jiang, and Alexander M. Rush. 2022. PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts. https://doi.org/10.48550/ARXIV.2202.01279
[5]
Som Biswas. 2023. ChatGPT and the future of medical writing., e223312 pages.
[6]
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2021. On the Opportunities and Risks of Foundation Models. arxiv:2108.07258 [cs.LG]
[7]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[8]
Bill Buxton. 2010. Sketching user experiences: getting the design right and the right design. Morgan Kaufmann.
[9]
Design Council. 2005. The ‘double diamond’ design process model. Design Council (2005).
[10]
Holly Cummins. 2018. Chatbot best practices. https://www.ibm.com/cloud/blog/chatbot-best-practices
[11]
Amy Cyphert. 2021. A Human Being Wrote This Law Review Article: GPT-3 and the Practice of Law. UC Davis Law Review 55, 1 (2021), 2022–02.
[12]
William Gaver. 2012. What should we expect from research through design?. In Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 937–946.
[13]
Elizabeth Goodman, Erik Stolterman, and Ron Wakkary. 2011. Understanding Interaction Design Practices. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Vancouver, BC, Canada) (CHI ’11). ACM, New York, NY, USA, 1061–1070. https://doi.org/10.1145/1978942.1979100
[14]
Erin Griffith and Cade Metz. 2023. A new area of A.I. booms, even amid the tech gloom. https://www.nytimes.com/2023/01/07/technology/generative-ai-chatgpt-investments.html
[15]
Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems.
[16]
Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New England Journal of Medicine 388, 13 (2023), 1233–1239.
[17]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arxiv:2107.13586 [cs.CL]
[18]
Damien Newman. 2009. The Process of Design Squiggle. thedesignsquiggle.com
[19]
Donald A Norman. 1999. Affordance, conventions, and design. interactions 6, 3 (1999), 38–43.
[20]
Owain Pedgley. 2007. Capturing and analysing own design activity. Design studies 28, 5 (2007), 463–483.
[21]
Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, Tali Bers, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M. Rush. 2021. Multitask Prompted Training Enables Zero-Shot Task Generalization. https://doi.org/10.48550/ARXIV.2110.08207
[22]
Donald Schön and John Bennett. 1996. Reflective conversation with materials. In Bringing design to software. ACM, 171–189.
[23]
Jessica Shieh. 2023. Best practices for prompt engineering with openai API. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
[24]
Stephanie Valencia, Michal Luria, Amy Pavel, Jeffrey P. Bigham, and Henny Admoni. 2021. Co-Designing Socially Assistive Sidekicks for Motion-Based AAC. In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction (Boulder, CO, USA) (HRI ’21). Association for Computing Machinery, New York, NY, USA, 24–33. https://doi.org/10.1145/3434073.3444646
[25]
Linxi Wang. 2019. Behind the Chatbot: Investigate the Design Process of Commercial Conversational Experience.
[26]
Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and Zhou Yu. 2019. Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 5635–5649.
[27]
Tongshuang Wu, Ellen Jiang, Aaron Donsbach, Jeff Gray, Alejandra Molina, Michael Terry, and Carrie J Cai. 2022. PromptChainer: Chaining Large Language Model Prompts through Visual Programming. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems.
[28]
Tongshuang Wu, Michael Terry, and Carrie J Cai. 2022. AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems.
[29]
Qian Yang, Justin Cranshaw, Saleema Amershi, Shamsi T Iqbal, and Jaime Teevan. 2019. Sketching nlp: A case study of exploring the right things to design with language intelligence. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–12.
[30]
Qian Yang, Alex Scuito, John Zimmerman, Jodi Forlizzi, and Aaron Steinfeld. 2018. Investigating How Experienced UX Designers Effectively Work with Machine Learning. In Proceedings of the 2018 Designing Interactive Systems Conference (Hong Kong, China) (DIS ’18). Association for Computing Machinery, New York, NY, USA, 585–596. https://doi.org/10.1145/3196709.3196730
[31]
Qian Yang, Aaron Steinfeld, Carolyn Rosé, and John Zimmerman. 2020. Re-Examining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376301
[32]
Wenmian Yang, Guangtao Zeng, Bowen Tan, Zeqian Ju, Subrato Chakravorty, Xuehai He, Shu Chen, Xingyi Yang, Qingyang Wu, Zhou Yu, Eric P. Xing, and Pengtao Xie. 2020. On the Generation of Medical Dialogues for COVID-19. CoRR abs/2005.05442 (2020). arXiv:2005.05442
[33]
J.D. Zamfirescu-Pereira, Richmond Wong, Bjoern Hartmann, and Qian Yang. 2023. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In CHI Conference on Human Factors in Computing Systems.
[34]
J. D. Zamfirescu-Pereira, Bjoern Hartmann, and Qian Yang. 2023. Conversation Regression Testing: A Design Technique for Prototyping Generalizable Prompt Strategies for Pre-trained Language Models. arxiv:2302.03154 [cs.HC]

Cited By

View all
  • (2024)Coverage-based Strategies for the Automated Synthesis of Test Scenarios for Conversational AgentsProceedings of the 5th ACM/IEEE International Conference on Automation of Software Test (AST 2024)10.1145/3644032.3644456(23-33)Online publication date: 15-Apr-2024
  • (2024)Automating the Development of Task-oriented LLM-based ChatbotsProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665538(1-10)Online publication date: 8-Jul-2024
  • (2024)ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into PrinciplesProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645144(853-868)Online publication date: 18-Mar-2024
  • Show More Cited By

Index Terms

  1. Herding AI Cats: Lessons from Designing a Chatbot by Prompting GPT-3

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DIS '23: Proceedings of the 2023 ACM Designing Interactive Systems Conference
    July 2023
    2717 pages
    ISBN:9781450398930
    DOI:10.1145/3563657
    This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 July 2023

    Check for updates

    Author Tags

    1. GPT.
    2. Prompt engineering
    3. UX
    4. conversational user interface

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • USAF / DARPA - SDCPS Program

    Conference

    DIS '23
    Sponsor:
    DIS '23: Designing Interactive Systems Conference
    July 10 - 14, 2023
    PA, Pittsburgh, USA

    Acceptance Rates

    Overall Acceptance Rate 1,158 of 4,684 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2,208
    • Downloads (Last 6 weeks)217
    Reflects downloads up to 02 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Coverage-based Strategies for the Automated Synthesis of Test Scenarios for Conversational AgentsProceedings of the 5th ACM/IEEE International Conference on Automation of Software Test (AST 2024)10.1145/3644032.3644456(23-33)Online publication date: 15-Apr-2024
    • (2024)Automating the Development of Task-oriented LLM-based ChatbotsProceedings of the 6th ACM Conference on Conversational User Interfaces10.1145/3640794.3665538(1-10)Online publication date: 8-Jul-2024
    • (2024)ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into PrinciplesProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645144(853-868)Online publication date: 18-Mar-2024
    • (2024)Measuring and Clustering Heterogeneous Chatbot DesignsACM Transactions on Software Engineering and Methodology10.1145/363722833:4(1-43)Online publication date: 17-Apr-2024
    • (2024)Designing for Human-Agent Alignment: Understanding what humans want from their agentsExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650948(1-6)Online publication date: 11-May-2024
    • (2024)On the Design of Quologue: Uncovering Opportunities and Challenges with Generative AI as a Resource for Creating a Self-Morphing E-book Metadata ArchiveExtended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650909(1-16)Online publication date: 11-May-2024
    • (2024)MindfulDiary: Harnessing Large Language Model to Support Psychiatric Patients' JournalingProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642937(1-20)Online publication date: 11-May-2024
    • (2024)Prompting for Discovery: Flexible Sense-Making for AI Art-Making with DreamsheetsProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642858(1-17)Online publication date: 11-May-2024
    • (2024)Bridging the Gulf of Envisioning: Cognitive Challenges in Prompt Based Interactions with LLMsProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642754(1-19)Online publication date: 11-May-2024
    • (2024)StayFocused: Examining the Effects of Reflective Prompts and Chatbot Support on Compulsive Smartphone UseProceedings of the CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642479(1-19)Online publication date: 11-May-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media