Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Breaking News: Case Studies of Generative AI’s Use in Journalism

Natalie Grace Brigham, Chongjiu Gao, Tadayoshi Kohno
Franziska Roesner, Niloofar Mireshghallah
University of Washington
{nbrigham, chongjiu, yoshi, franzi, niloofar}@cs.washington.edu
Abstract

Journalists are among the many users of large language models (LLMs). To better understand the journalist-AI interactions, we conduct a study of LLM usage by two news agencies through browsing the WildChat dataset, identifying candidate interactions, and verifying them by matching to online published articles. Our analysis uncovers instances where journalists provide sensitive material such as confidential correspondence with sources or articles from other agencies to the LLM as stimuli and prompt it to generate articles, and publish these machine-generated articles with limited intervention (median output-publication ROUGE-L of 0.62). Based on our findings, we call for further research into what constitutes responsible use of AI, and the establishment of clear guidelines and best practices on using LLMs in a journalistic context.

Breaking News: Case Studies of Generative AI’s Use in Journalism


Natalie Grace Brigham, Chongjiu Gao, Tadayoshi Kohno Franziska Roesner, Niloofar Mireshghallah University of Washington {nbrigham, chongjiu, yoshi, franzi, niloofar}@cs.washington.edu


1 Introduction

LLMs present the opportunity to increase productivity in newsrooms and several initiatives are aiming to assist news organizations in finding, training, and applying AI-based solutions responsibly Associated Press ; Partnership on AI . However, many have raised concerns about the application of LLMs in the field of journalism, including misinformation Pan et al. (2023), copyright violations Karamolegkou et al. (2023), and privacy implications Yao et al. (2024). Due to these issues, LLMs can be seen as a threat to journalistic integrity Wihbey (2024).

Refer to caption
Figure 1: The distribution of different input stimulus material types provided by journalists to LLMs, over the WildChat conversations that are matched to online articles from the two identified news agencies.
Refer to caption
Figure 2: Case study of a single-turn journalist-LLM interaction for article generation where an external article by another agency is used as the input stimulus. The generated draft is published by the journalist with little modification as the ROUGE-L between the published article we identified and the generation is 0.71. The ‘[’ and ‘]’ symbols denote portions of the text that have been replaced to minimize identifiability.

While previous work surveyed journalists’ reported usage of generative AI Diakopoulos et al. (2024); Gondwe (2023), the opaque nature of newsrooms and challenges in detecting AI-generated content Chakraborty et al. (2023); Weber-Wulff et al. (2023) have left a knowledge gap regarding its on-the-ground utilization. This work aims to bridge this gap by analyzing journalist-AI interactions, including queries, provided materials, intervention levels, and query-to-publish timelines.

To achieve this, we probe the the publicly available WildChat dataset Zhao et al. (2024) of human-chatbot interactions. We identify potential journalist queries in WildChat, match them to published articles on two news agency websites, and analyze the resulting set of interactions. Given WildChat’s scope and that other agencies likely use generative AI, we refer to the two identified agencies as Agency A and Agency B to maintain anonymity (see limitations and ethical considerations).

We categorize the types of tasks for which LLMs were used across Agencies A and B and the distribution of input materials employed to generate articles, shown in Figure 1. Inputs include articles from other agencies or private conversations, real case examples of which are depicted in Figures 2 and 3 (Agencies B and A, respectively). We also analyze the extent of human intervention in article generation by examining the overlap between the model generated drafts and the matched published articles, along with the time between generation and publication. Finally, we go beyond the WildChat dataset and study machine-generated text on Agency A’s and B’s websites using GPTZero Tian and Cui (2023), a state-of-the-art commercial LLM-generation detector.

Our findings raise concerns about the level of diligence and journalistic integrity, as we find substantial overlap between the model outputs and the published articles with a median ROUGE-L Lin (2004) score of 0.62, and a prompt-to-publication span of a single day. Alarmingly, 18%percent1818\%18 % of the identified input stimuli are other agencies’ news articles and 9%percent99\%9 % are potential private conversations, risking privacy breaches by sharing the data with the LLM-provider and publicly via WildChat. Our study points to a need for better guidelines surrounding LLM use by journalists, and also better AI literacy and education for model practitioners. A potential path forward could draw from human-computer interaction and usable security and privacy research on “nudging” users toward beneficial behaviors Acquisti et al. (2017), which may offer a promising solution in this scenario as well.

Related work. Our work relates to studies on human-chatbot interactions for assistive writing, such as in academic paper writing Liang et al. (2024b) and in the peer-review process Liang et al. (2024a), but differs in topic and methodology, providing visibility into interactions and ground truth for machine-generated articles. We provide an extended study of related work in Appendix A.

2 Method

Refer to caption
Figure 3: Case study of a muli-turn article generation using multiple stimuli, including an internal article from the same agency and an interview transcript. The ‘[’ and ‘]’ symbols denote portions of the text that have been replaced to minimize identifiability.

WildChat. Since our aim is to closely study the type of queries made by journalists to LLMs and to uncover interventions they make to model outputs before publication, we probe WildChat Zhao et al. (2024), a publicly available dataset of 650k111Recently a 1M version of this dataset was released, however our study is conducted on an earlier version. conversations collected by offering free access to GPT-3.5 and GPT-4 to users, for potential candidate interactions.

Identifying journalist queries. We initially identified conversations with 4444 or more PII types using an NER model222lakshyakh93/deberta_finetuned_pii, as we were interested in identifying sensitive disclosures. This yielded a 5k5𝑘5k5 italic_k turn subset of the data. We then manually inspected this set and found instances of article generation for two different news agencies: Agency A, a local news outlet based in southern California, and Agency B, a local news platform covering a small nation in the Mediterranean region. Using location-based keywords for these agencies, we filtered the entire 650k conversation WildChat dataset and manually reviewed the results to find additional conversations related to these agencies.

We acknowledge that this process may result in false negatives, missing cases of journalistic activity in WildChat that we did not identify. For reproducibility of results we have uploaded the list of verified WildChat conversation IDs along with our submission. (See ethical considerations for reasons why we upload conversation IDs but do not mention the specific agencies in the main text.)

To match the output of conversational turns to published articles and verify that the queries were made by the journalists, we searched the identified agencies’ websites for articles with highly similar content to the generated text.

Categorizing tasks. For each verified conversational turn, based on the user’s prompt, we classify the journalistic tasks: (1) article generation which involves the LLM creating a new article from the provided user instruction and input material, (2) headline generation which involves generating a headline for a given article, and (3) article editing which involves the user requesting edits to a provided draft article.

Categorizing input stimuli. By stimulus, we refer to the input material provided to the LLM assistant as source or context. One researcher examined all user prompts for article generation, developed a codebook that was reviewed by the team, and, to ensure coding consistency and retain interpretive nuance  Armstrong et al. (1997); Morse (1997), coded all stimuli. Then, a second researcher independently reviewed and validated the coding Whittemore et al. (2001). Stimuli was only coded as an externally sourced type (e.g., press release, external news article) when verbatim text matched online sources. Thus, these codes represent a lower bound on the external source material used.

Measuring journalist intervention. As a proxy for how much the journalists modify the model’s output before publication online, we report the ROUGE-L recall score Lin (2004) between the generated output (used as source) and the matched online article. This score reflects the longest common subsequence between the two text sequences, normalized to the length of the source. We also report this score over the model’s input and output.

Detecting LLM-generated articles beyond WildChat. To extend our study beyond the WildChat dataset, we scrape articles from the two identified news agencies’ websites from between January 1, 2020 and April 15, 2024, use GPTZero Tian and Cui (2023), a state-of-the-art commercial machine-generated text detection method to study the prevalence of LLM use, and report our findings.

3 Findings

Agency A B Sum Conversations 62 16 78 Turns 107 41 148 Verified Published Articles 79 32 111

Table 1: Evidence of AI-assisted journalism from WildChat

Table 1 summarizes the identified content from WildChat, with the last row showing the number of online articles we could match to the identified activity. We select two case studies that represent archetypal examples of different article generation behaviors, and use them for demonstrations throughout this section: Figure 2, a case of a single-turn article generation (from Agency B) and Figure 3, a multi-turn process of article generation (from Agency A). Below we discuss our observations from the two agencies. Appendices D and E provide more fine grained results for Sections 3.1 and 3.2, respectively.

3.1 What do journalists prompt LLMs with?

Figure 1 shows the distribution of input stimulus material. Across the 148 turns, there is a total of 182 stimuli used (all turns had at least one stimulus as input and some had multiple, some turns had multiple instance of the same stimulus type).

Stimuli from external sources (i.e., news articles from other agencies, press releases, organizational reports, event postings, organizations’ websites, social media posts) accounted for over two-thirds (83/137) of the identified stimuli for Agency A and about half (21/45) for Agency B. We also observe internally sourced stimuli types, such as draft material and articles originating from within the agency. Other types, like interview transcripts, are generated internally but may contain information and material from external sources, including potentially sensitive information from or about those sources.

In Figure 2, we observe Agency B using a news article from another agency as a stimulus, a practice we identified relatively frequently for both agencies. In another case of this, a user from Agency B specified, “write a new article out of the information in this article, do not make it obvious you are taking information from them but in very sensitive information give them credit.” Conversely, Figure 3 shows Agency A employing a mix of internal and external stimuli.

In terms of the tasks users specified in their inputs, the vast majority of turns, 83.1%, are for article generation, 14.5% are headline generation, and a minority are article editing. We provide more thorough results for the identified task types in Appendix B.

3.2 How much do journalists modify model outputs before publication?

The median ROUGE-L score between the machine-generated drafts and published article text is 0.62 (see Figure 4) which indicates limited human editing before publication. As a point of reference, a ROUGE score of 0.5 is considered high in privacy and policy domains Huang et al. (2023); MacLaughlin et al. (2020). Figure 2 depicts a specific use where the ROUGE-L score between the machine-generated draft and the published article is high at 0.71, indicating high overlap with the source text. In Figure 3, the ROUGE-L scores between the drafts and published articles are slightly lower but still relatively high. We also found a separate query (see Appendix C) to generate the headline for this article, and the user publishes the machine-generated headline with no editing.

Refer to caption
Figure 4: Box plots of ROUGE-L scores between outputs from users’ article generations prompts and LLM response (left) as well as the LLM-generated article and the corresponding published article (right).

3.3 What is the prompt to publication time?

For Agency A, the majority (48/79) of verified articles were published one day after the recorded activity in WildChat. Similarly, the majority (19/32) of verified articles for Agency B were published on the same day as the recorded activity. The days of human editing time before publication seen in Figures 2 and 3 are consistent with these findings, as illustrated in Figure 5.

Refer to caption
Figure 5: Distribution of days from generation (based on date in WildChat) to publication (based on published article’s date) for published articles matched to turns of article generation in WildChat.

3.4 Are there more LLM-generated articles?

To study the temporal trends in using LLMs for writing articles, we scrape articles from Agencies A and B, between January 1, 2020, and April 15, 2024, subsample 585 articles from each agency uniformly and utilize GPTZero to detect which articles might be machine-generated. Our findings, depicted in Figure 6 suggest that there are likely many additional published articles that may have been generated using ChatGPT. Furthermore, we observe a noticeable increase in articles with higher probabilities of machine generations following ChatGPT’s release in November 2022.

Refer to caption
(a) Agency A
Refer to caption
(b) Agency B
Figure 6: GPTZero probability scores for randomly selected articles from our scrape as well as the published articles matched to WildChat. The grey vertical line indicates November 2022, when ChatGPT was released.

4 Conclusion

We investigate the use of commercial LLMs in journalism by analyzing conversations from the WildChat dataset and matching them to published articles online. Our findings reveal the use of potentially unethical material to generate articles, limited human oversight on model outputs before publication, and the use of LLMs by the identified agencies beyond the scope of WildChat. These results suggest continuous, increasing generative AI use for news generation and necessitate guidelines for responsible AI in journalism.

Acknowledgements

This work was supported in part by NSF Award #2205171. We thank Yuntian Deng, Yanai Elazar, Melanie Sclar and Maria Antoniak for insightful discussions and feedback.

Limitations

We acknowledge that we are constrained by the WildChat dataset itself. We can only identify and study journalist-AI interactions included in the dataset, which limits our ability to generalize our results beyond the two agencies identified. Furthermore, given the nature of WildChat, we can only examine interactions with GPT-4 and GPT-3.5.

There may also be interactions that we were unable to identify within the dataset. To mitigate this, we performed additional searches for other cases of article generation, headline generation, and article editing, using relevant keywords and prompting GPT to annotate instances of non-fictional journalistic activity. Upon manually reviewing these results, we did not identify any additional conversations.

Additionally, there are concerns and questions regarding the effectiveness and feasibility of detecting machine-generated text Solaiman et al. (2019); Chakraborty et al. (2023); Sadasivan et al. (2023); Weber-Wulff et al. (2023). We recognize that GPTZero, the tool used for detection in this study, does not provide ground truth predictions.

Ethical Considerations

This study is motivated by a recognition that generative AI capabilities are rapidly evolving and their potential impacts on journalism call for thoughtful, thorough research. Our goal is to provide insights that can help shape guidelines and policies governing the responsible use of AI in this domain.

We hypothesize that Agencies A and B are not unique cases of news organizations using generative AI, but rather examples of a broader phenomenon that happen to be present in the WildChat dataset. Furthermore, the intent of this work is not to “call out” specific agencies or individual users, but to investigate and better understand the actual use of generative AI in journalism. Accordingly, we have anonymized the identities of the agencies in this paper.

Although we anonymize the agencies’ names and censor identifying information, we provide metadata and verbatim text that could be used to identify them. We include this information for scientific rigor and thoroughness as well as recognize that, even without it, others could independently identify the news agencies through WildChat. However, we do not encourage or condone the use of this data beyond our stated purpose of understanding AI usage in journalism. Our intent is to advance knowledge, not target or harm any individuals or organizations.

References

Appendix A Extended Related Work

Research at the intersection of journalism and generative AI is still in its early stages. However, work has begun to investigate its potential and actual uses in the field. As part of the Associated Press’s Local News AI Initiative, Diakopoulos et al. surveyed 292 individuals in the news industry about their use and opinions of generative AI in newsrooms. The survey revealed that generative AI is already being used for tasks, such as content production, and changing workflows and role definitions in the newsroom. Gondwe Gondwe (2023) investigated the use of ChatGPT by journalists in sub-Saharan Africa, finding that the system’s training on a non-representative African corpus limits its utility for the studied population.

Bdoor and Habes Bdoor and Habes (2024) experimented with using GPT to generate news content and discussed the trade-offs around its adoption in the newsroom. Pavlik Pavlik (2023) ‘co-authored’ an essay with ChatGPT to demonstrate both the capacity and limitations of generative AI in journalism and media education.

To address concerns about AI-generated news media, Kumarage et al. Kumarage et al. (2023) developed J-Guard, a framework for directing supervised AI-generated text detectors to identify AI-generated news articles.

Appendix B Task Type

Table 2 and Figure 7 provide breakdown of task types.

Task (turn count) Agency A Agency B
Article generation 89 34
Headline generation 18 1
Article editing 0 6
Table 2: Turn counts for each task type across both agencies.
Refer to caption
(a) Aggregated over both agencies
Refer to caption
(b) Agency A
Refer to caption
(c) Agency B
Figure 7: The distribution of different task types across turns from verified Wildchat journalist-LLM conversations.

Appendix C Additional Case Study

Figure 8 is an extension of Figure 3 where the user generates a headline for the machine-drafted article.

Refer to caption
Figure 8: A case study of headline generation, extending from Figure 3. The ‘[’ and ‘]’ symbols denote portions of the text that have been replaced to minimize identifiability.

Appendix D Stimuli

Table 3 provides stimuli types, descriptions, and frequencies. Table 4 includes the top five most frequent combinations of stimuli used in a turn of article generation. Figure 9 visualize the distributions of stimuli types for Agencies A and B, respectively.

Stimuli type Agency A Agency B
Draft or other: Draft of article or unidentified material 39 9
Press release: Article on the same topic released by the subject of the story 30 4
External news article: Article on the same topic from another news agency 18 15
Interview: Transcript or written interview responses 14 2
Organizational report: Official report from a city or country government or organization (e.g., a school district’s 100 Day Plan) 12 2
Event posting: Posting of event details (e.g., Meetup, event website) 12 0
Organization website: General information on an organization, company, or person (e.g., “About” page) 9 0
Intra-agency news article: Article on the same topic from within the agency 1 8
Email: Message from a source or editor about a story idea 0 5
Social media post: Post on social media (e.g., LinkedIn, GoFundMe) 2 0
Table 3: Types of stimulus used to generate articles in the identified WildChat activity and their frequencies across Agencies A and B, sorted in decreasing order from top to bottom by their combined frequencies.
Stimuli type(s) Agency A frequency Stimuli type(s) Agency B frequency
Draft or other 16 External news article 13
Press release 16 Draft or other 5
Organizational report 11 Press release 3
External news article 6 Email 3
Draft or other, external news article 5 Organizational report 2
Table 4: Top five most frequent combinations of stimuli types used in an individual turn of article generation.
Refer to caption
(a) Agency A
Refer to caption
(b) Agency B
Figure 9: The distribution of input stimuli types over the verified Wildchat journalist-LLM interactions for both agencies.

Appendix E ROUGE-L

Figure 10 depicts ROUGE-L distributions for prompts to GPT outputs and those outputs to published articles for both agencies.

Refer to caption
(a) Agency A
Refer to caption
(b) Agency B
Figure 10: ROUGE-L scores for prompt to machine-generated output (left) and that output to published article text (right) for both agencies.