Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3652988.3673967acmconferencesArticle/Chapter ViewAbstractPublication PagesivaConference Proceedingsconference-collections
research-article
Open access

On LLM Wizards: Identifying Large Language Models' Behaviors for Wizard of Oz Experiments

Published: 26 December 2024 Publication History

Abstract

The Wizard of Oz (WoZ) method is a widely adopted research approach where a human Wizard “role-plays” a not readily available technology and interacts with participants to elicit user behaviors and probe the design space. With the growing ability for modern large language models (LLMs) to role-play, one can apply LLMs as Wizards in WoZ experiments with better scalability and lower cost than the traditional approach. However, methodological guidance on responsibly applying LLMs in WoZ experiments and a systematic evaluation of LLMs’ role-playing ability are lacking. Through two LLM-powered WoZ studies, we take the first step towards identifying an experiment lifecycle for researchers to safely integrate LLMs into WoZ experiments and interpret data generated from settings that involve Wizards role-played by LLMs. We also contribute a heuristic-based evaluation framework that allows the estimation of LLMs’ role-playing ability in WoZ experiments and reveals LLMs’ behavior patterns at scale.

Supplemental Material

PDF File
Additional details such as an explanation of research limitations, example LLM conversations, sample prompts, and detailed metrics examples.

References

[1]
Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis 31, 3 (2023), 337–351.
[2]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. 2022. Constitutional AI: Harmlessness from AI Feedback. arxiv:2212.08073 [cs.CL]
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[4]
Jacob T. Browne. 2019. Wizard of Oz Prototyping for Machine Learning Experiences. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI EA ’19). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3290607.3312877
[5]
Gillian Cameron, David Cameron, Gavin Megaw, Raymond Bond, Maurice Mulvenna, Siobhan O’Neill, Cherie Armour, and Michael McTear. 2017. Towards a Chatbot for Digital Counselling. In Proceedings of the 31st British Computer Society Human Computer Interaction Conference (Sunderland, UK) (HCI ’17). BCS Learning & Development Ltd., Swindon, GBR, Article 24, 7 pages. https://doi.org/10.14236/ewic/HCI2017.24
[6]
Maximillian Chen, Weiyan Shi, Feifan Yan, Ryan Hou, Jingwen Zhang, Saurav Sahay, and Zhou Yu. 2022. Seamlessly Integrating Factual Information and Social Content with Persuasive Dialogue. arxiv:2203.07657 [cs.CL]
[7]
Cheng-Han Chiang and Hung yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations?arxiv:2305.01937 [cs.CL]
[8]
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
[9]
Nils Dahlbäck, Arne Jönsson, and Lars Ahrenberg. 1993. Wizard of Oz Studies: Why and How. In Proceedings of the 1st International Conference on Intelligent User Interfaces (Orlando, Florida, USA) (IUI ’93). Association for Computing Machinery, New York, NY, USA, 193–200. https://doi.org/10.1145/169891.169968
[10]
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 862–872. https://doi.org/10.1145/3442188.3445924
[11]
Danica Dillion, Niket Tandon, Yuling Gu, and Kurt Gray. 2023. Can AI language models replace human participants?Trends in Cognitive Sciences 27, 7 (2023), 597–600. https://doi.org/10.1016/j.tics.2023.04.008
[12]
Steven Dow, Jaemin Lee, Christopher Oezbek, Blair MacIntyre, Jay David Bolter, and Maribeth Gandy. 2005. Wizard of Oz Interfaces for Mixed Reality Applications. In CHI ’05 Extended Abstracts on Human Factors in Computing Systems (Portland, OR, USA) (CHI EA ’05). Association for Computing Machinery, New York, NY, USA, 1339–1342. https://doi.org/10.1145/1056808.1056911
[13]
A.E. Elo. 2008. The Rating of Chessplayers: Past and Present. Ishi Press International, New York, USA.
[14]
Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, and Xiaohang Zhao. 2023. Bias of AI-Generated Content: An Examination of News Produced by Large Language Models. arxiv:2309.09825 [cs.AI]
[15]
Rudolph Flesch. 1948. A new readability yardstick.Journal of applied psychology 32, 3 (1948), 221.
[16]
Kenneth D Forbus, Dedre Gentner, and Keith Law. 1995. MAC/FAC: A model of similarity-based retrieval. Cognitive science 19, 2 (1995), 141–205.
[17]
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023).
[18]
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. arxiv:2009.11462 [cs.CL]
[19]
Paul Green and Lisa Wei-Haas. 1985. The Rapid Development of User Interfaces: Experience with the Wizard of OZ Method. Proceedings of the Human Factors Society Annual Meeting 29, 5 (1985), 470–474. https://doi.org/10.1177/154193128502900515 arXiv:https://doi.org/10.1177/154193128502900515
[20]
Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2023. Evaluating Large Language Models in Generating Synthetic HCI Research Data: A Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 433, 19 pages. https://doi.org/10.1145/3544548.3580688
[21]
Xu Han, Michelle Zhou, Matthew J. Turner, and Tom Yeh. 2021. Designing Effective Interview Chatbots: Automatic Chatbot Profiling and Design Suggestion Generation for Chatbot Debugging. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 389, 15 pages. https://doi.org/10.1145/3411764.3445569
[22]
Jacqueline Harding, William D’Alessandro, NG Laskowski, and Robert Long. 2023. Ai language models cannot replace human research participants. AI & SOCIETY 28, 3 (2023), 1–3.
[23]
Totte Harinen, Alexandre Filipowicz, Shabnam Hakimi, Rumen Iliev, Matthew Klenk, and Emily Sumner. 2021. Machine learning reveals how personalized climate communication can both succeed and backfire. arxiv:2109.05104 [cs.LG]
[24]
Ryuichiro Higashinaka, Masahiro Araki, Hiroshi Tsukahara, and Masahiro Mizukami. 2021. Integrated taxonomy of errors in chat-oriented dialogue systems. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue. 89–98.
[25]
Minlie Huang, Xiaoyan Zhu, and Jianfeng Gao. 2020. Challenges in building intelligent open-domain dialog systems. ACM Transactions on Information Systems (TOIS) 38, 3 (2020), 1–32.
[26]
C. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. Proceedings of the International AAAI Conference on Web and Social Media 8, 1 (May 2014), 216–225. https://doi.org/10.1609/icwsm.v8i1.14550
[27]
Boris V Janssen, Geert Kazemier, and Marc G Besselink. 2023. The use of ChatGPT and other large language models in surgical science., zrad032 pages.
[28]
Eunkyung Jo, Daniel A. Epstein, Hyunhoon Jung, and Young-Ho Kim. 2023. Understanding the Benefits and Challenges of Deploying Conversational AI Leveraging Large Language Models for Public Health Intervention. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 18, 16 pages. https://doi.org/10.1145/3544548.3581503
[29]
Elise Karinshak, Sunny Xun Liu, Joon Sung Park, and Jeffrey T Hancock. 2023. Working With AI to Persuade: Examining a Large Language Model’s Ability to Generate Pro-Vaccination Messages. Proceedings of the ACM on Human-Computer Interaction 7, CSCW1 (2023), 1–29.
[30]
J. F. Kelley. 1983. An Empirical Methodology for Writing User-Friendly Natural Language Computer Applications. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Boston, Massachusetts, USA) (CHI ’83). Association for Computing Machinery, New York, NY, USA, 193–196. https://doi.org/10.1145/800045.801609
[31]
John F. Kelley. 1984. An Iterative Design Methodology for User-Friendly Natural Language Office Information Applications. ACM Trans. Inf. Syst. 2, 1 (January 1984), 26–41. https://doi.org/10.1145/357417.357420
[32]
Soomin Kim, Joonhwan Lee, and Gahgene Gweon. 2019. Comparing Data from Chatbot and Web Surveys: Effects of Platform and Conversational Style on Survey Response Quality. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300316
[33]
Scott R. Klemmer, Anoop K. Sinha, Jack Chen, James A. Landay, Nadeem Aboobaker, and Annie Wang. 2000. Suede: A Wizard of Oz Prototyping Tool for Speech User Interfaces. In Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology (San Diego, California, USA) (UIST ’00). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/354401.354406
[34]
Matthew L Lee, Scott Carter, Rumen Iliev, Nayeli Suseth Bravo, Monica P Van, Laurent Denoue, Everlyne Kimani, Alexandre L. S. Filipowicz, David A. Shamma, Kate A Sieck, Candice Hogan, and Charlene C. Wu. 2023. Understanding People’s Perception and Usage of Plug-in Electric Hybrids. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 201, 21 pages. https://doi.org/10.1145/3544548.3581301
[35]
Yi-Chieh Lee, Naomi Yamashita, Yun Huang, and Wai Fu. 2020. "I Hear You, I Feel You": Encouraging Deep Self-Disclosure through a Chatbot. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376175
[36]
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. arxiv:2303.17760 [cs.AI]
[37]
Shiyang Li, Jun Yan, Hai Wang, Zheng Tang, Xiang Ren, Vijay Srinivasan, and Hongxia Jin. 2023. Instruction-following Evaluation through Verbalizer Manipulation. arxiv:2307.10558 [cs.CL]
[38]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of summaries. In Proceedings of the ACL Workshop: Text Summarization Braches Out. Association for Computational Linguistics, Barcelona, Spain, 10.
[39]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. arxiv:2303.16634 [cs.CL]
[40]
Danica Mast, Alex Roidl, and Antti Jylha. 2023. Wizard of Oz Prototyping for Interactive Spatial Augmented Reality in HCI Education: Experiences with Rapid Prototyping for Interactive Spatial Augmented Reality. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI EA ’23). Association for Computing Machinery, New York, NY, USA, Article 407, 10 pages. https://doi.org/10.1145/3544549.3573861
[41]
David Maulsby, Saul Greenberg, and Richard Mander. 1993. Prototyping an Intelligent Agent through Wizard of Oz. In Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems (Amsterdam, The Netherlands) (CHI ’93). Association for Computing Machinery, New York, NY, USA, 277–284. https://doi.org/10.1145/169059.169215
[42]
Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Hakkani-Tür. 2023. Using In-Context Learning to Improve Dialogue Safety. arxiv:2302.00871 [cs.CL]
[43]
Indrani Medhi Thies, Nandita Menon, Sneha Magapu, Manisha Subramony, and Jacki O’neill. 2017. How do you want your chatbot? An exploratory Wizard-of-Oz study with young, urban Indians. In Human-Computer Interaction-INTERACT 2017: 16th IFIP TC 13 International Conference, Mumbai, India, September 25–29, 2017, Proceedings, Part I 16. Springer, Springer, Mumbai, India, 441–459.
[44]
Elliot Mitchell and Lena Mamykina. 2021. From the Curtain to Kansas: Conducting Wizard-of-Oz Studies in the Wild. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI EA ’21). Association for Computing Machinery, New York, NY, USA, Article 46, 6 pages. https://doi.org/10.1145/3411763.3443446
[45]
Yoo Jung Oh, Jingwen Zhang, Min-Lin Fang, and Yoshimi Fukuoka. 2021. A systematic review of artificial intelligence chatbots for promoting physical activity, healthy diet, and weight loss. International Journal of Behavioral Nutrition and Physical Activity 18 (2021), 1–25.
[46]
OpenAI. 2023. API Reference. https://platform.openai.com/docs/api-reference. Accessed: 2023-11-25.
[47]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318.
[48]
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arxiv:2304.03442 [cs.HC]
[49]
Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017. Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, Pennsylvania, 2231––2240. https://doi.org/10.18653/v1/d17-1237
[50]
Hasso Plattner, Christoph Meinel, and Ulrich Weinberg. 2009. Design thinking. Springer, Germany.
[51]
Sven Reichel, Ute Ehrlich, André Berton, and Michael Weber. 2014. In-car multi-domain spoken dialogs: A wizard of oz study. In Proceedings of the EACL 2014 Workshop on Dialogue in Motion. Association for Computational Linguistics, Gothenburg, Sweden, 1–9.
[52]
Greg Serapio-García, Mustafa Safdari, Clément Crepy, Luning Sun, Stephen Fitz, Peter Romero, Marwa Abdulhai, Aleksandra Faust, and Maja Matarić. 2023. Personality Traits in Large Language Models. arxiv:2307.00184 [cs.CL]
[53]
Murray Shanahan, Kyle McDonell, and Laria Reynolds. 2023. Role-Play with Large Language Models. arxiv:2305.16367 [cs.CL]
[54]
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. [n. d.]. The Woman Worked as a Babysitter: On Biases in Language Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, China, 3407–3412. https://doi.org/10.18653/v1/D19-1339
[55]
Weiyan Shi, Xuewei Wang, Yoo Jung Oh, Jingwen Zhang, Saurav Sahay, and Zhou Yu. 2020. Effects of Persuasive Dialogues: Testing Bot Identities and Inquiry Strategies. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376843
[56]
Masahiro Shiomi, Takayuki Kanda, Satoshi Koizumi, Hiroshi Ishiguro, and Norihiro Hagita. 2007. Group Attention Control for Communication Robots with Wizard of OZ Approach. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (Arlington, Virginia, USA) (HRI ’07). Association for Computing Machinery, New York, NY, USA, 121–128. https://doi.org/10.1145/1228716.1228733
[57]
Aaron Steinfeld, Odest Chadwicke Jenkins, and Brian Scassellati. 2009. The Oz of Wizard: Simulating the Human for Interaction Research. In Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction (La Jolla, California, USA) (HRI ’09). Association for Computing Machinery, New York, NY, USA, 101–108. https://doi.org/10.1145/1514095.1514115
[58]
Navid Tavanapour and Eva A. C. Bittner. 2018. Automated Facilitation for Idea Platforms: Design and Evaluation of a Chatbot Prototype. In Proceedings of the International Conference on Information Systems - Bridging the Internet of People, Data, and Things 2018(ICIS 2018), Jan Pries-Heje, Sudha Ram, and Michael Rosemann (Eds.). Association for Information Systems, San Francisco, CA, USA, 9 pages. https://aisel.aisnet.org/icis2018/general/Presentations/8
[59]
Carlos Toxtli, Andrés Monroy-Hernández, and Justin Cranshaw. 2018. Understanding Chatbot-Mediated Task Management. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3173574.3173632
[60]
Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, and Nanyun Peng. 2023. "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in LLM-Generated Reference Letters. arxiv:2310.09219 [cs.CL]
[61]
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. 2023. Large Language Models are not Fair Evaluators. arxiv:2305.17926 [cs.CL]
[62]
Nick Webb, David Benyon, Jay Bradley, Preben Hansen, and Oli Mival. 2010. Wizard of Oz Experiments for a companion dialogue system: eliciting companionable conversation. In In Proceedings of the Seventh conference on International Language Resources and Evaluation(LREC ’10). European Language Resources Association (ELRA), Valletta, Malta, 5 pages.
[63]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
[64]
Alex C. Williams, Harmanpreet Kaur, Gloria Mark, Anne Loomis Thompson, Shamsi T. Iqbal, and Jaime Teevan. 2018. Supporting Workplace Detachment and Reattachment with Conversational Intelligence. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3173574.3173662
[65]
Janie H Wilson, Rebecca G Ryan, and James L Pugh. 2010. Professor–student rapport scale predicts student outcomes. Teaching of Psychology 37, 4 (2010), 246–251.
[66]
Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Derek F. Wong, and Lidia S. Chao. 2023. A Survey on LLM-generated Text Detection: Necessity, Methods, and Future Directions. arxiv:2310.14724 [cs.CL]
[67]
Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. 2023. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. arxiv:2307.02477 [cs.CL]
[68]
Ziang Xiao, Tiffany Wenting Li, Karrie Karahalios, and Hari Sundaram. 2023. Inform the Uninformed: Improving Online Informed Consent Reading with an AI-Powered Chatbot. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 112, 17 pages. https://doi.org/10.1145/3544548.3581252
[69]
Ziang Xiao, Michelle X. Zhou, and Wat-Tat Fu. 2019. Who Should Be My Teammates: Using a Conversational Agent to Understand Individuals and Help Teaming. In Proceedings of the 24th International Conference on Intelligent User Interfaces (Marina del Ray, California) (IUI ’19). Association for Computing Machinery, New York, NY, USA, 437–447. https://doi.org/10.1145/3301275.3302264
[70]
Ziang Xiao, Michelle X Zhou, Q Vera Liao, Gloria Mark, Changyan Chi, Wenxi Chen, and Huahai Yang. 2020. Tell me about yourself: Using an AI-powered chatbot to conduct conversational surveys with open-ended questions. ACM Transactions on Computer-Human Interaction (TOCHI) 27, 3 (2020), 1–37.
[71]
Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017. A New Chatbot for Customer Service on Social Media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (Denver, Colorado, USA) (CHI ’17). Association for Computing Machinery, New York, NY, USA, 3506–3510. https://doi.org/10.1145/3025453.3025496
[72]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arxiv:2305.10601 [cs.CL]
[73]
Gamze Yilmaz and Kate G Blackburn. 2022. How to ask for donations: a language perspective on online fundraising success. Atlantic Journal of Communication 30, 1 (2022), 32–47.
[74]
Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. 2023. Evaluating Large Language Models at Evaluating Instruction Following. arxiv:2310.07641 [cs.CL]
[75]
Guanhua Zhang, Bing Bai, Junqi Zhang, Kun Bai, Conghui Zhu, and Tiejun Zhao. 2020. Demographics Should Not Be the Reason of Toxicity: Mitigating Discrimination in Text Classifications with Instance Weighting. arxiv:2004.14088 [cs.CL]
[76]
Jingwen Zhang, Yoo Jung Oh, Patrick Lange, Zhou Yu, and Yoshimi Fukuoka. 2020. Artificial intelligence chatbot behavior change model for designing artificial intelligence chatbots to promote physical activity and a healthy diet. Journal of medical Internet research 22, 9 (2020), e22845.
[77]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arxiv:2306.05685 [cs.CL]
[78]
Kyrie Zhixuan Zhou and Madelyn Rose Sanfilippo. 2023. Public Perceptions of Gender Bias in Large Language Models: Cases of ChatGPT and Ernie. arxiv:2309.09120 [cs.AI]

Index Terms

  1. On LLM Wizards: Identifying Large Language Models' Behaviors for Wizard of Oz Experiments

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    IVA '24: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents
    September 2024
    337 pages
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 December 2024

    Check for updates

    Author Tags

    1. LLM
    2. Wizard of Oz
    3. WoZ
    4. large language model
    5. methods
    6. persuasive conversation
    7. synthetic data

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    IVA '24
    Sponsor:
    IVA '24: ACM International Conference on Intelligent Virtual Agents
    September 16 - 19, 2024
    GLASGOW, United Kingdom

    Acceptance Rates

    Overall Acceptance Rate 53 of 196 submissions, 27%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 107
      Total Downloads
    • Downloads (Last 12 months)107
    • Downloads (Last 6 weeks)84
    Reflects downloads up to 11 Feb 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media