1 Introduction
With the rise of digital technologies in our lives, society has not only experienced their benefits but also increasingly their undesirable consequences. Research and news headlines describe seemingly unavoidable side effects of digital technologies, from Instagram’s adverse effects on adolescent girls’ body images [
52] to Microsoft’s chatbot Tay using racist language [
112]. Technological progress is commonly seen as a moral commitment that is “legitimized no matter how dangerous” [
83, p.325]. While undesirable consequences are sometimes described as accidental and minor
“blips”, researchers, journalists, and policymakers have suggested that many cases could have been avoided if technology developers were aware of similar issues and had taken cautious evaluation beforehand [
14,
61].
However, anticipating the various outcomes of technology is difficult [
70]. In fact, recent work has found that computer science (CS) researchers at the forefront of developing new technologies are eager to proactively consider undesirable consequences of their innovations, but lack well-formulated processes and tools to do so effectively [
31]. They reported that not having resources that provide a comprehensive understanding of “common problems” reduced their ability to anticipate undesirable consequences [
31, p.7]. Could insights into past undesirable consequences of technology help them gain awareness of potential future consequences?
We study this question by collecting a catalog of “common problems,” allowing CS researchers to explore past undesirable consequences as reported in technology magazines and research papers. Learning from prior incidents has proven to be useful in several settings, ranging from exploratory forecasting of technological advances [
60], improving software by studying a collection of previous defects [
49], to training future pilots using the aviation accidents database [
55]. Incorporating known and real-world case studies of ethical dilemmas into an undergraduate Human-Centered Computing class has also been shown to amplify students’ engagement in ethical thinking [
98]. What remains unknown is whether providing CS researchers with examples can have similar advantages, increasing their awareness of the various societal impacts of technology and supporting them in considering the potential consequences of their own projects.
A challenge in the domain of undesirable consequences is the lack of such resource across diverse CS subfields, and the unclear impact on CS researchers, who often lack the time for such in-depth consideration for diverse consequences [31, 99]. Hence, the goal of this paper is to explore whether providing CS researchers with a catalog of “common problems” would improve their awareness of undesirable consequences. Our secondary goals are to find out how we can feasibly collect a self-updating catalog given that this information is currently scattered and the technology landscape is fast-moving, and how a system providing this service would be perceived and used by CS researchers. We tackle these questions by designing, developing, and evaluating
Blip, a prototype system that collects and showcases
a catalog of undesirable consequences of digital technologies.
Blip (1) automatically extracts real-world undesirable consequences of technology from any given online article using natural language processing (NLP) techniques, (2) summarizes and categorizes them based on the aspect of life that they affect (such as health, equality, or politics), and (3) presents them in an interactive, web-based interface (see Figure
1). Users can use
Blip to view, sort, and save the currently 5.7k summaries of undesirable consequences, or extract undesirable consequences from additional articles.
While considering undesirable consequences is not yet a common practice among researchers, we designed Blip to facilitate this process in the future. We tested our assumption about Blip’s usefulness in two user studies. In the first study with nine CS researchers, we assessed Blip’s overall usefulness to consider undesirable consequences in their broader field (e.g., social media) compared to two alternative approaches—relying on their prior knowledge and searching for undesirable consequences online—and conducted in-depth interviews to further understand users’ perceptions of Blip and potential use cases. Our results show that Blip enabled participants to add an average of 7.00 more undesirable consequences beyond those they could list when relying on prior knowledge and searching online. Participants perceived Blip as improving their ability to “think outside the box”, made them aware of consequences that they “had never considered,” and was an essential way to collect undesirable consequences “because you can’t just read a bunch of disconnected articles about this [and make sense of it].”
In our second study with six CS researchers, we followed up on these results, evaluating whether Blip is useful and actionable in the context of specific projects that participants work on across CS subdisciplines. All participants could find several undesirable consequences relevant to their specific projects in less than 15 minutes, on average. Some of these were immediately actionable. Overall, this paper contributes:
(1)
Empirical evidence that a catalog of undesirable consequences supports CS researchers in considering more, and more diverse, undesirable consequences than if they rely on their prior knowledge or an online search (Study 1) and that it enables them to uncover potential adverse effects of their own projects (Study 2).
(2)
An open-source, web-based system,
Blip 1, that collects, summarizes, and categories undesirable consequences. To develop
Blip, we designed an information distillation pipeline that leverages NLP techniques to efficiently establish a self-updating catalog of undesirable consequences.
2 Related Work
We use the term “undesirable consequences” to refer to negative consequences of digital technology that affect society [
27]. Oftentimes, undesirable consequences are unanticipated or even unintended [
65]. We chose to work with “undesirable consequences” over the more prevalent “unintended consequences” of technology to emphasize that our primary concern is with exploring the adverse effects of technology. Research in HCI and Science and Technology Studies (STS) has contributed a large body of work on observed negative effects of digital technology domains and products, including mobile phones [
69,
87], the sharing economy [
30], machine learning [
18,
19], and social media [
28,
100]. Researchers have also described various aspects of our lives that may be adversely affected by technology, such as its impacts on the environment [
12], health [
5,
43], or privacy [
3]. Moreover, researchers have increasingly investigated and brought to our attention differential effects of digital technology on certain population groups, such as on different gender [
32] and racial groups [
17,
93], low-income and underserved communities [
30], or people in other countries [
79,
86,
92,
106].
Discussions and interventions for addressing undesirable consequences in research. With the increasing awareness of the potential adverse effects of technological innovations, the research community has started to engage in several efforts to prevent such incidents. For example, researchers have developed guidelines for ethical research and development [
4,
104], started dedicated conferences, such as FAccT, AIES, EAAMO, SIGCAS, and dedicated tracks (e.g., Critical Computing@CHI), led workshops [
102] and ethical committees [
20,
35], as well as called for changes in institutional structures [
10,
80], critical education [
53], and in how we address undesirable consequences of digital technologies [
14,
61]. A key concrete step was the inclusion of broad ethics or impact statements in major conferences, such as IUI [
1], NeurIPS [
11], ACL [
101]. Nanayakkara et al. [
71] found that such statements diversified thinking about how ML research could potentially impact society, though they tended to focus on positive impact [
6]. Additionally, there are calls for researchers within different computing communities to accurately report the design considerations of their datasets [
9,
38], models [
67,
89], and tasks [
56,
68] as well as evaluate and de-bias their products [
4,
8,
104,
111].
Methods for forecasting and anticipating undesirable consequences. Researchers have designed tools to help identify and contemplate social values of different stakeholders, such as the Envisioning Cards [
36], Tarot Cards of Tech [
41], and Value Cards [
94] (see also [
22] for a detailed overview). The value-sensitive design approach has also contributed broad guidelines for researchers seeking to account for human values in a principled and systematic manner throughout the design process [
37,
118]. The Future Ripples method [
34], inspired by the Futures Wheel foresight method in education [
39], allows collaborative brainstorming on the impact of innovation through workshop activities.
However, some of these approaches and methods have been challenged for not sufficiently supporting practitioners and the reality of the product-development process [
40]. Using these methods requires prior knowledge on the topic at hand, which may not always be the case for novice users. They also require developers to deliberate in a team, sometimes with external experts, simultaneously and collectively where envisioned consequences can vary depending on the team’s diversity and backgrounds. In fact, an interview with 20 CS researchers found that none of the participants are actively using these tools in practice [
31].
An alternative approach for anticipating undesirable effects of technology is
learning from past incidents [
64,
114]. While perfectly predicting the future may be impossible, researchers have developed various methods to estimate what may happen from such past experiences. For example, the Delphi forecasting method [
95,
109] has been used in a wide variety of domains such as predicting air travel [
33] and designing educational technology [
76]. Another forecasting method is the case study method, which collects people’s thoughts on and experiences with past technology developments in an organization [
21].
One issue for the widespread use of these methods is that they usually rely on experts to collect and interpret historical data, making it difficult to scale and frequently use them.
In another attempt to anticipate undesirable consequences, prior work has developed a forum to collect news articles about technologies [
73] and an AI incident database specifically for the effects of AI technology on society [
64]. One of the motivations for this database was that “the artificial intelligence system community has no formal systems whereby practitioners can discover and learn from the mistakes of the past” [
64, p.1]. However, the AI incident database necessitates the crowd to manually browse and enter incidences, using a leaderboard to incentivize contributions from volunteers. To the best of our knowledge, no previous system exists that automatically and systematically catalogs, summarizes, and categorizes undesirable consequences for a variety of technology domains. None of these approaches have been formally evaluated to show their usefulness for anticipating undesirable consequences.
In short, many prior tools prompt users to reflect on high-level ethical questions, which requires users to have prior knowledge without easy access to updated real-world examples. This paper explores the value of providing researchers with concrete examples that can ground their ethical considerations in practice.
Supporting ideation of potential undesirable consequences.
Blip was also inspired by work on creativity support tools, which showed that a collection of diverse examples can support ideation [
81,
97]. For example, sampling diverse inspirational examples (and providing a visual overview of the ideas) has been found to improve people’s brainstorming activity [
96]. Similar work on cognition and creativity support confirmed that examples inspire and unveil new and diverse ideas [
29,
51,
74,
75,
116]. To organize these examples, prior work leveraged categories of certain topics and characteristics, which are essential to human cognition [
90,
108]. For example, IdeaRelate facilitates the exploration of COVID-related examples by tagging them into different topics, helping users to include more perspectives in their own idea generation [
113]. Recent work attempted to incorporate
language models to help ideate potential harms [
16,
82]. In particular, AngleKindling used few-shot LLM prompts to find potential controversies and negative outcomes from press releases to help journalists generate story ideas [
84]. However, zero-shot and few-shot approaches to generate consequences had resulted in rather generic results [
84]. In this work, we aid the inherently creative process of reflecting on past and possible future adverse effects by providing a catalog of undesirable consequences, supplemented with information on the diverse aspects of life that they have affected. Instead of relying on ideas generated entirely from language models, we extract relevant information directly from a wide range of online articles, provide access to the original content, and update our collection
every week.
6 Discussion
Our goal in this work was to evaluate whether providing CS researchers with an easily accessible catalog of undesirable consequences of digital technologies could improve their ability to learn about and consider adverse effects, as prior work had suggested [
31,
64,
98]. To study this question, we developed
Blip, a web-based prototype that leverages language models to automatically derive undesirable consequences from any given online article.
Blip addresses the difficulty of having a broad knowledge of potential undesirable consequences, which, according to Merton, is “the most obvious limitation to a correct anticipation.” [
65]
Our results show Blip’s potential for supporting CS researchers in gaining awareness of a broad range of undesirable consequences. In Study 1, Blip supported researchers in finding more, and more diverse, undesirable consequences of technology in their CS subdiscipline even after listing ones from prior knowledge and after searching online. When relying on their prior knowledge, we found that participants only thought of an average of 6.11 unique undesirable consequences despite being experts in their technology domain. They were often stuck describing undesirable consequences within one or two commonly known “aspects” (that were sometimes part of their research focus). This indicates that many researchers do not have a thorough and broad awareness of the undesirable consequences within their technology domain. Intuitively, searching online might be a better option to explore additional undesirable consequences to extend users’ prior knowledge. However, our results showed that searching online only added an average of 3.88 undesirable consequences and was perceived as a tedious approach. The fixation issue persisted when searching online, that is, participants mostly used search terms related to the undesirable consequences they had already listed. This limitation underscores the insufficiency of traditional methods, as they often lead to a narrow focus, overlooking broader and potentially more impactful consequences. Compared to the two baseline conditions, Blip was able to support participants in listing and learning about undesirable consequences that were often beyond the commonly known ones. To summarize these results, our study demonstrates that relying on prior knowledge and an online search, without any tooling support, is often perceived by the participants as tedious and insufficient for thinking about the undesirable consequences of technology.
In Study 1, the qualitative responses further illustrate the most helpful parts of
Blip’s design. We found that participants perceived
Blip’s summaries of undesirable consequences as beneficial for efficiently gaining an overview, while appreciating having access to the original articles to ensure information integrity.
Blip’s categorization of different life aspects was seen as a motivating nudge for exploring consequences broadly—a finding that is in line with the results of our quantitative analysis. The result extends prior work in creativity and cognition that has found that providing a solution space using a set of dimensions breaks people’s tendency to fixate [
74,
96]: Providing a diverse set of undesirable consequences can help technology experts consider societal implications broadly and reveal those that they would have otherwise not thought about.
Study 2 aimed to evaluate whether
Blip provides actionable information when freely using it to find potential undesirable consequences relevant to researchers’ specific projects, rather than to the whole field. We found that participants took less than 15 minutes, on average, to gather a set of consequences relevant to a specific research project and bookmarked an average of 7.67 unique undesirable consequences during this time. While this second study was not designed to determine whether participants were able to
comprehensively find undesirable consequences, participants’ comments suggested that the ones they found inspired them to think of undesirable consequences more broadly. The finding also suggests that
Blip could be a useful resource for gathering undesirable consequences to include in a paper’s ethics statement, well beyond the average of 0.6 words that are included in ethics statements in NeurIPS AI papers, for example [
6,
71].
Our follow-up interview and survey, however, painted a more complicated picture of anticipating undesirable consequences using
Blip in practice. After using
Blip for their own projects, two of six participants in the second study were neutral on
Blip’s usefulness for their projects, though all agreed that the system is useful
for others. Some participants acknowledged that they are not in the right position to think about these issues, though they learned “a lot about social impact.” The result resonates with the narrative reflected in Do et al.’s work [
31] that CS researchers tend to deflect the responsibility to consider the adverse effects of technology.
This brings us to how we see
Blip can support researchers in the future. Participants suggested that they would use
Blip for inspiration when writing broader impact statements in papers, which could lower the perceived burden of writing them [
2] and potentially counteract the focus on desirable outcomes [
103]. However, ultimately it would be ideal if CS researchers routinely learn about, anticipate, and reflect on undesirable consequences—as has been repeatedly advocated for [
44,
61]—and that they do so early and proactively when addressing undesirable consequences is still feasible [
31]. We envision
Blip as a tool that supports researchers in doing so, both by appealing to their intrinsic curiosity and by having extrinsic incentives such as fulfilling the requirements of a conference, grant agency, and institution.
Per Study 2,
Blip, while a useful tool, is not a magical bullet to achieve this paradigm alone. Doing so would not only require a catalog of concrete consequences by
Blip but also a systematic change in culture and structural incentives. Tools like
Blip may spark new conversations and alleviate the perceived burden of anticipating undesirable consequences particularly if research institutions evolve to actively encourage this reflection. For example, researchers can easily explore a wide variety of undesirable consequences in their domain for past incidents before launching a new project. When writing ethics statements (e.g., for papers or grant proposals), researchers may use
Blip to efficiently and thoroughly examine their case. They could engage with the information and stay updated on the latest undesirable consequences. A crucial aspect of this institutional change is nurturing a mindset among researchers that recognizes the importance of contemplating these challenges, thereby embedding the practice of considering undesirable consequences as a fundamental aspect of responsible research.
In the long run, we envision
Blip to become an integral part of the technology development and research process by using strong incentives and implementing systemic changes as suggested in prior work [
10,
44]. We hope that using
Blip will inspire the research community to work towards such a future in which learning about and anticipating undesirable consequences soon becomes the norm.
7 Limitation & Future Work
A limitation of the Blip prototype is that it currently only uses online technology magazines and CHI papers to retrieve undesirable consequences of technology. These may not be a comprehensive catalog of undesirable consequences. In particular, the catalog may not adequately reflect the consequences that diverse user groups experience given that these articles are commonly written for ‘tech-savvy’ audiences. In future work, we plan to systematically explore the difference in the reporting of undesirable consequences across tech magazines, newspapers, and research papers from diverse fields and augment Blip’s sources. Future work should also incorporate non-English language articles, or non-American media outlets, to better reflect the effect of technology on diverse users. Another improvement is to incorporate multiple aspects rather than one category. Encouraged by the feedback from our participants, we also believe that there are exciting opportunities to enable citizen scientists to document their personal experiences with undesirable consequences in Blip. This could satisfy users’ desire to share their own experiences while enabling insights into potential differential effects of technology on people.
We designed
Blip using LLMs due to the increasing performance on NLP tasks such as classification and summarization [
15]. Nevertheless, LLMs can also introduce serious undesirable consequences, such as model hallucination, biases in the training data, and a limited understanding of emerging fields. Our work extracts relevant information directly from trusted sources (i.e., online articles and papers) and provides access to the original content, instead of directly prompting LLMs to generate the information. In this paper, we offered a preliminary evaluation of the individual components within
Blip, using quantitative metrics such as F-1 score. While our metrics are comparable to similar ML tasks (see Section
3.4), there is a tradeoff between foraging consequences at scale and ensuring perfect accuracy. Involving citizen scientists could help improve the overall quality of the
Blip data curation process, such as by providing their feedback on the article relevance and label accuracy and by creating their summary and labels on the existing articles.
We also foresee several different use cases for
Blip. For example, our participants indicated that
Blip could help researchers seek inspiration when writing broader impact statements for publications or grant proposals. Researchers may use
Blip to introduce the background knowledge in their areas to highlight their solutions to the public. Additionally, policymakers and practitioners could use it to inform their work. Journalists could leverage a system like
Blip to discover new angles when reporting news, especially technology mishaps (see a related tool specifically designed for journalists for story inspiration [
59]). The interested public can fairly efficiently learn about how digital technology has already affected, and will continue to affect, their lives. We believe that
Blip could also aid different stakeholders in reflecting on societal issues.