Abstract
Online reviews serve as a guide for consumer choice. With advancements in large language models (LLMs) and generative AI, the fast and inexpensive creation of human-like text may threaten the feedback function of online reviews if neither readers nor platforms can differentiate between human-written and AI-generated content. In two experiments, we found that humans cannot recognize AI-written reviews. Even with monetary incentives for accuracy, both Type I and Type II errors were common: human reviews were often mistaken for AI-generated reviews, and even more frequently, AI-generated reviews were mistaken for human reviews. This held true across various ratings, emotional tones, review lengths, and participants’ genders, education levels, and AI expertise. Younger participants were somewhat better at distinguishing between human and AI reviews. An additional study revealed that current AI detectors were also fooled by AI-generated reviews. We discuss the implications of our findings on trust erosion, manipulation, regulation, consumer behavior, AI detection, market structure, innovation, and review platforms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Data and code is available from the author at request.
Notes
Zhang et al. (2016) define fake reviews as “deceptive reviews provided with an intention to mislead consumers in their purchase decision making, often by reviewers with little or no actual experience with the products or services being reviewed. Fake reviews can be either unwarranted positive reviews aiming to promote a product, or unjustified false negative comments on competing products in order to damage their reputations.”
Wu et al. (2020) highlight an interesting exception: some newly established review platforms intentionally add fake reviews and copy reviews from other platforms to give the impression that their platform is widely used, thereby circumventing the catch-22 of platforms: users do not arrive until reviews are posted, and reviews are not posted until users arrive.
This refined prompt is based on a simpler one from our pilot study, where we found that GPT-4 produces longer texts without shortening instructions. Participants often identified human-generated reviews by typos, misspellings, or unusual spellings like ALL CAPS, leading us to incorporate these in the GPT prompt.
Given the full randomization, participants may or may not have seen both a human- and an AI-written review of the same restaurant.
We targeted 150 participants, but after a participant timeout and replacement by Prolific, the original participant returned, completing the survey, and resulting in 151 participants.
We used ChatGPT to code the reviews for valence, emotionality, presence of typos, profanity, and informal expressions. Specifically, we instructed GPT-4 to “Here is a restaurant review. [XXX] Code this review for each of the following dimensions: sentiment (from 0 to 100, where 100 is highly positive), emotionality (from 0 to 100, where 100 is highly sentimental), the number of typos or misspellings, the number of profane words or expressions, and the number of informal expressions. Put in a table.” We cross-checked a few of these answers and agreed with GPT-4’s answers so we used these values in these regressions.
The sample of restaurants in Study 2 is different from the sample of restaurants and reviews in Study 1. In Study 2, we only included restaurants that received at least 10 English-language reviews in 2019.
References
Agnihotri, A., & Bhattacharya, S. (2016). Online review helpfulness: Role of qualitative factors. Psychology & Marketing, 33(11), 1006–1017.
Ahmad, W., & Sun, J. (2018). Modeling consumer distrust of online hotel reviews. International Journal of Hospitality Management, 71, 77–90.
Ananthakrishnan, U. M., Li, B., & Smith, M. D. (2020). A tangled web: Should online review portals display fraudulent reviews? Information Systems Research, 31(3), 950–971.
Archak, N., Ghose, A., & Ipeirotis, P. G. (2011). Deriving the pricing power of product features by mining consumer reviews. Management Science, 57(8), 1485–1509.
Brandl, R., & Ellis, C. (2023). Survey: ChatGPT and AI Content –Can people tell the difference? Retrieved from https://www.tooltester.com/en/blog/chatgpt-survey-can-people-tell-the-difference/
Cheung, C. M., & Lee, M. K. (2012). What drives consumers to spread electronic word of mouth in online consumer-opinion platforms. Decision Support Systems, 53(1), 218–225.
Chevalier, J. A., & Mayzlin, D. (2006). The effect of word of mouth on sales: Online book reviews. Journal of Marketing Research, 43(3), 345–354.
Dellarocas, C. (2003). The digitization of word of mouth: Promise and challenges of online feedback mechanisms. Management Science, 49(10), 1407–1424.
Dellarocas, C., Zhang, X. M., & Awad, N. F. (2007). Exploring the value of online product reviews in forecasting sales: The case of motion pictures. Journal of Interactive Marketing, 21(4), 23–45.
Han, J., Pei, J., & Tong, H. (2022). Data mining: Concepts and techniques. Morgan Kaufmann.
He, S., Hollenbeck, B., & Proserpio, D. (2022). The market for fake reviews. Marketing Science, 41(5), 896–921.
Ippolito, D., Duckworth, D., Callison-Burch, C., & Eck, D. (2019). Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650
Jago, A. S. (2019). Algorithms and authenticity. Academy of Management Discoveries, 5(1), 38–56.
Jakesch, M., Hancock, J. T., & Naaman, M. (2023). Human heuristics for AI-generated language are flawed. Proceedings of the National Academy of Sciences, 120(11), e2208839120.
Köbis, N., & Mossink, L. D. (2021). Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry. Computers in Human Behavior, 114, 106553.
Kovács, B. (2024). Studying travel networks using establishment Covisit networks in online review data. Socius, 10, 23780231241228916.
Kovács, B., & Carroll, G. R. (2023). Distinguishing between cosmopolitans and omnivores in organizational audiences. Academy of Management Discoveries, 9(4), 549–577.
Kovács, B., Carroll, G. R., & Lehman, D. W. (2014). Authenticity and consumer value ratings: Empirical tests from the restaurant domain. Organization Science, 25(2), 458–478.
Kozinets, R. V. (2002). The field behind the screen: Using netnography for marketing research in online communities. Journal of Marketing Research, 39(1), 61–72.
Laudon, K. C., & Laudon, J. P. (2004). Management information systems: Managing the digital firm. Pearson Education.
Le Mens, G., Kovács, B., Hannan, M. T., & Pros, G. (2023). Uncovering the semantics of concepts using GPT-4. Proceedings of the National Academy of Sciences, 120(49), e2309350120.
Li, X., & Hitt, L. M. (2008). Self-selection and information role of online product reviews. Information Systems Research, 19(4), 456–474.
Luca, M., & Zervas, G. (2016). Fake it till you make it: Reputation, competition, and Yelp review fraud. Management Science, 62(12), 3412–3427.
Mayzlin, D., Dover, Y., & Chevalier, J. (2014). Promotional reviews: An empirical investigation of online review manipulation. American Economic Review, 104(8), 2421–2455.
Miller, E. J., Steward, B. A., Witkower, Z., Sutherland, C. A., Krumhuber, E. G., & Dawel, A. (2023). AI hyperrealism: Why AI faces are perceived as more real than human ones. Psychological Science, 34(12), 1390–1403.
Mudambi, S. M., & Schuff, D. (2010). What makes a helpful review? A study of customer reviews on Amazon.com. MIS Quarterly, 34(1), 185–200.
Netzer, O., Feldman, R., Goldenberg, J., & Fresko, M. (2012). Mine your own business: Market-structure surveillance through text mining. Marketing Science, 31(3), 521–543.
Orenstrakh, M. S., Karnalim, O., Suarez, C. A., & Liut, M. (2023). Detecting llm-generated text in computing education: A comparative study for chatgpt cases. arXiv preprint arXiv:2307.07411
Pavlou, P. A., & Dimoka, A. (2006). The nature and role of feedback text comments in online marketplaces: Implications for trust building, price premiums, and seller differentiation. Information Systems Research, 17(4), 392–414.
Pavlou, P. A., & Gefen, D. (2004). Building effective online marketplaces with institution-based trust. Information Systems Research, 15(1), 37–59.
Pentina, I., Bailey, A. A., & Zhang, L. (2018). Exploring effects of source similarity, message valence, and receiver regulatory focus on yelp review persuasiveness and purchase intentions. Journal of Marketing Communications, 24(2), 125–145.
Sharkey, A., Kovács, B., & Hsu, G. (2023). Expert critics, rankings, and review aggregators: The changing nature of intermediation and the rise of markets with multiple intermediaries. Academy of Management Annals, 17(1), 1–36.
Tadelis, S. (2016). Reputation and feedback systems in online platform markets. Annual Review of Economics, 8, 321–340.
Turing, A. M. (1950). Computing machinery and intelligence. Mind, LIX(236), 433–460.
Uchendu, A., Ma, Z., Le, T., Zhang, R., & Lee, D. (2021). Turingbench: A benchmark environment for Turing test in the age of neural text generation. arXiv preprint arXiv:2109.13296
Wu, Y., Ngai, E. W., Wu, P., & Wu, C. (2020). Fake online reviews: Literature review, synthesis, and directions for future research. Decision Support Systems, 132, 113280.
Zhang, D., Zhou, L., Kehoe, J. L., & Kilic, I. Y. (2016). What online reviewer behaviors really matter? Effects of verbal and nonverbal behaviors on detection of fake online reviews. Journal of Management Information Systems, 33(2), 456–481.
Zhang, T., Li, G., Cheng, T., & Lai, K. K. (2017). Welfare economics of review information: Implications for the online selling platform owner. International Journal of Production Economics, 184, 69–79.
Zhao, Y., Yang, S., Narayan, V., & Zhao, Y. (2013). Modeling consumer learning from online product reviews. Marketing Science, 32(1), 153–169.
Acknowledgements
This research has benefitted from feedback from Glenn Carroll, Jennifer Dannals, Jerker Denrell, Balázs Gyenis, Arthur Jago, and Iris Wang. All remaining errors are my own.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical approval
Yale University’s IRB Board approved the research, IRB # 1508016387.
Informed consent
Consent was collected at the beginning of the experiment.
Conflict of interest
The author declares no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Kovács, B. The Turing test of online reviews: Can we tell the difference between human-written and GPT-4-written online reviews?. Mark Lett (2024). https://doi.org/10.1007/s11002-024-09729-3
Accepted:
Published:
DOI: https://doi.org/10.1007/s11002-024-09729-3