Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3544548.3581351acmconferencesArticle/Chapter ViewFull TextPublication PageschiConference Proceedingsconference-collections
research-article
Open access

Comparing Sentence-Level Suggestions to Message-Level Suggestions in AI-Mediated Communication

Published: 19 April 2023 Publication History

Abstract

Traditionally, writing assistance systems have focused on short or even single-word suggestions. Recently, large language models like GPT-3 have made it possible to generate significantly longer natural-sounding suggestions, offering more advanced assistance opportunities. This study explores the trade-offs between sentence- vs. message-level suggestions for AI-mediated communication. We recruited 120 participants to act as staffers from legislators’ offices who often need to respond to large volumes of constituent concerns. Participants were asked to reply to emails with different types of assistance. The results show that participants receiving message-level suggestions responded faster and were more satisfied with the experience, as they mainly edited the suggested drafts. In addition, the texts they wrote were evaluated as more helpful by others. In comparison, participants receiving sentence-level assistance retained a higher sense of agency, but took longer for the task as they needed to plan the flow of their responses and decide when to use suggestions. Our findings have implications for designing task-appropriate communication assistance systems.

1 Introduction

Traditional communication assistance systems have generally focused on short suggestions to improve input efficiency. With the emergence of large language models such as GPT-3 [8], it has become possible to generate significantly longer natural-sounding text suggestions, opening up opportunities to design more advanced writing assistance to help humans with more complex tasks in more substantial ways [31, 48]. Such assistance can be especially helpful in communication scenarios in which a single point of contact needs to manage large volumes of correspondence, e.g., customer service representatives addressing customers’ queries, professors attending to students’ emails, as well as elected officials responding to their constituents’ concerns.
The generative capabilities of current models enable a wide range of possibilities for designing assistance systems. In this work, we explore two writing assistant design choices, namely sentence-level and message-level text suggestions, and empirically analyze the trade-offs between them in email communication.1 We consider the practical scenario of staffers from legislators’ offices responding to vast amounts of constituents’ concerns as the context for our study. In this context, the volume of correspondence can become overwhelming, making intelligent assistance especially needed [13, 42]. At the same time, the high-stakes nature of political communication calls for more careful and comprehensive research to better understand the potential benefits and risks of any type of technical assistance that may be introduced to the existing workflow.
To advance our understanding of different assistance options, we develop dispatch (Section 3), an application that can serve as a platform to simulate the process of a staffer responding to constituents’ emails, allowing us to design and set up an online experiment to test different types of suggestions (Section 4). We recruited 120 participants to act as staffers from legislators’ offices to respond to three emails expressing constituents’ concerns under three different conditions: 40 participants received no assistance, 40 received sentence-level suggestions, and 40 message-level suggestions, with both types of suggestions generated by GPT-3.
By observing participants’ interactions with text suggestions and surveying their perceptions of the assistance they received, we are able to compare how sentence-level suggestions and message-level suggestions may affect the participants’ writing experience as well as the eventual responses produced. The results show that participants who receive message-level suggestions generally found the suggested drafts natural and mainly edited on top of them. They were able to finish their responses significantly faster, demonstrating an increased level of efficiency in responding, and were generally more satisfied with the assistance they received. In comparison, participants receiving sentence-level assistance took longer since they still had to plan for the key points to cover in their responses, while deciding where to trigger and use suggestions. While they retained a higher sense of agency, they spent longer time and reported lower levels of satisfaction, demonstrating the challenging nature of designing assistance systems with finer-grained control. This discrepancy points to the need to take receivers’ perspectives into account when designing and introducing assistance systems into the workflow in communication circumstance where trust is highly valued. We discuss the implications of our experimental results for designing task-appropriate communication assistance systems (Section 5).

2 Related Work

Figure 1:
Figure 1: The basic dispatch interface. The subject and body of the constituent letter are displayed on the left, while an editor for drafting the response is provided on the right.

2.1 Advances in AI text generation

Advances in machine learning have led to a new generation of language models [6] capable of producing text indistinguishable from human-written content [24, 29]. Enabled by improvements in computer hardware and the transformer neural network architecture [43], models like GPT-3 [8] have attracted attention for their ability to generate text that mimics the style and substance of the inputs. Cautious voices have warned about the ethical and social risks of harm from large language models [45, 46], ranging from discrimination and exclusion [8, 23, 35] to misinformation [29, 33, 39] and environmental [41] and socioeconomic harms [5].
However, these same technologies have potential to usher in a range of beneficial real-world applications [6]. These models have the potential to aid in journalism, curate weather and financial reports, and write customer-service responses, with particular value in domains where the task is either repetitive or has high volume writing requirements.
Building on the core technological foundation, more recent research in computer science, HCI, and linguistics has focused on input efficiency, often by exploiting linguistic information to speed up the writing process [30]. Early predictive text systems such as T9 relied on word frequencies to suggest word continuations [25]. More advanced systems combine behavioral data [18] with information at the sentence level [44] to predict users’ intentions to complete entire phrases or sentences [2, 11]. To increase the likelihood of a matching suggestion, systems like today’s smartphone keyboards provide multiple suggestions in parallel [26]. Some systems, like Google’s Smart Compose [12] use the estimated utility or probability of acceptance to determine whether suggestions should be shown.
Writing assistants usually provide short, or single-word suggestions only [15, 17, 38], with the assumption that for longer suggestions, the time required to evaluate the suggestion may distract from or even slow down the process of composition. Indeed, prior studies have suggested that writing suggestions can reduce typing performance and deteriorate the user experience [3, 10, 37]. However, this may change with advances in the quality of text generated by language models [6, 8]. Massive transformer neural networks [43] that manage to capture more complex user intents may be able to provide higher quality suggestions that are likely to be useful, thus reducing the relative cost of evaluation. In addition, these models may provide ideas and inspiration [31, 40, 49] beyond simply increasing text input efficiency.

2.2 The use of technology in political communication

Democratic accountability implies communication between elected leaders and constituents, wherein constituents write to express their concerns and preferences and elected leaders respond to articulate how they plan to or have addressed these preferences [20, 22]. As technology has made it easier to contact members of Congress, for example through representative websites with “contact” buttons and civil society organizations providing email templates, the volume of mail has increased considerably, making the task of meaningfully processing and responding to correspondence more difficult [16].
Social media has provided one way for elected leaders to correspond with large numbers of constituents to help understand their concerns and explain leaders’ positions [4]. While voters have a number of tools available to contact their legislators and craft letters and emails, legislators lack analogous tools to help craft responses. This contributes to legislative staffers, tasked with the responsibility of reading and responding to the large volume of incoming mail, being less responsive to communications from some groups [4].
It is in this space where AI-mediated communication could potentially be fruitful. Research on the use of language models or writing assistants within the political process has thus far been limited; however, we can look to work on the political ramifications of social media feeds and recommender systems [50] to offer clues about the possible impact of these advancements. Despite initial excitement about these technologies’ democratic potential [27], scholars have identified the potential for these technologies to become the subject of powerful political and commercial interests [7] that may undermine democratic institutions [1]. Even unintentionally, design choices related to algorithmic optimization may lead to self-reinforcing opinion dynamics [9]. Similarly, language model writing assistants also have to be designed carefully if they are to be used for for political communication. Prior work has shown that when language models perform poorly (e.g., produce repetitive outputs), they may corrode constituents’ trust in their elected representatives [28]. This suggests that humans curating model outputs, as well as continued improvements in providing diverse types of high quality suggestions—from short, single-word suggestions to paragraph or even longer length—could alleviate these impediments.
Building on this research, it appears that human-crafted responses, facilitated by language models that offer suggestions, could play a role in facilitating thoughtful interaction. Kreps et al. [29] have shown that people are largely unable to detect political news generated by recent generations of language models, suggesting that these models, assisted by a human in the loop, would be effective at generating content that bridges elected leaders with their constituents. Nonetheless, any use of language models for political purposes will need to be carefully assessed in the light of their political consequences that research such as this uncovers.

3 Designing Dispatch

To understand the trade-offs between message-level and sentence-level suggestions, we built dispatch, a platform to simulate the scenario of staffers from legislative offices responding to constituents’ concerns. The basic dispatch interface is shown in Figure 1. A letter from a constituent is displayed on the left, while an editor for drafting the response is provided on the right. The editor supports all typical actions for writing, e.g., typing, deleting, and cursor movements.
On top of this basic interface, we build two different versions of dispatch, one that offers sentence-level suggestions and another that offers message-level suggestions.
Sentence-level suggestions. We allow users to trigger two types of sentence-level suggestions. First, users can receive suggestions responding to specific points raised in the constituent letter by highlighting the sentence to respond to and then typing “@" in the editor (Figure 2). Second, users can trigger suggestions to continue the text they have already written2 by typing “@" without highlighting any text. In both cases, users are presented with a drop-down menu displaying five candidate suggestions, from which they can select one or none.
Message-level suggestions. In the message-level suggestion interface, users can click on the “Generate" button to obtain a full response draft. The suggestion is directly loaded into the editor (Figure 3) for users to make further edit.3
Figure 2:
Figure 2: Users can trigger sentence-level suggestions by typing “@" in the editor. Five candidate suggestions will be shown in a drop-down menu. If the user has highlighted a sentence in the email (as shown in the figure), response suggestions are provided. Otherwise, suggestions are given to continue the current draft.
Figure 3:
Figure 3: Users can trigger a message-level suggestion by clicking the “Generate" button. The response suggestion will be directly loaded in the editor on the left as shown.

4 System Evaluation

4.1 Experiment setup

Constituents’ letters. As proxies for constituents’ emails, we sample open letters delivered to elected officials in the United States through Resistbot, a service that advertises the ability to compose and send letters to legislators in less than two minutes.4 We obtain images of these open letters by retrieving tweets published by @openletterbot5 using the Twitter API and extract the contents of the letters using Python Tesseract.6 For each letter, we keep only the content of the letter, removing the sender’s first name and the state they are from, if applicable. In addition, we only consider letters that are sent by multiple people to ensure that the letters are representative but not personally identifiable.
To select letters to use as prompts for participants to respond, we consider topics that are generally relatable but not overtly polarizing. Our choice is based on two considerations. First, common concerns constitute a significant portion of the emails legislators need to address, as we observe a substantial number of near-duplicate open letters expressing similar concerns. Second, more general topics would make the task manageable to our participants who might not be familiar with very niche topics. While staffers would also have to respond to more specific questions, focusing on common ones is sufficient for exploring the difference between the two types of suggestion types. The twelve letters we select span topics such as health insurance, climate change, and COVID relief policies.7 The average length of the selected letters is 97 words.
Figure 4:
Figure 4: A. Sample constituent letter and reply from the message-level suggestions condition. Black text was suggested, red struck-out text was removed from the suggestion, and green text was added by the participant. B. Sample reply and constituent letter from the sentence-level suggestions condition. Each circled number represents a place where the participant prompted a suggestion. (1) and (3) were prompted with letter text highlighted. The highlighted text and the suggestion given were highlighted in purple in (1) and light blue in (3). Suggestion (2) was a continuation suggestion, prompted without any text in the constituent letter highlighted. Best viewed in color.
Experimental conditions. In our experiment, we randomly assign participants to one of three conditions:
(1)
Control: participants respond to emails in their own writing, with no response suggestions.
(2)
Sentence-level suggestions: participants have the option to trigger sentence-level suggestions, with the option to accept and edit them.
(3)
Message-level suggestions: participants have the option to trigger an automatically generated full email response draft to edit.
Generation model. We use GPT-3 (specifically, the text-davinci-002 model without any fine-tuning) to generate suggestions. We set max_tokens to 200 when generating full email drafts and 20 when generating sentence-level suggestions. Under both conditions, we set temperature to 0.7 and top_p to 0.96 throughout. Our application was reviewed and approved by OpenAI before launching the experiment.
Participants. We recruit 40 participants for each experimental condition via the platform Prolific [36]. We only consider participants who are located in the United States, fluent in English, and have listed “politics” as one of their hobbies. Each eligible participant is allowed to take part in at most one condition. We pay $5.00 for each task session based on an estimated completion time of 20 minutes. The actual completion time in each condition is shown in Figure 5. The experiment received Institutional Review Board approval from Cornell University.

4.2 Writers’ experience

To understand the participants’ writing processes, we track both the overall time they spend on the task, the suggestions they trigger, and the responses they submit. Figure 4 shows sample responses annotated with participants’ interactions with the text suggestions.
Completion time. We compare the average completion time across three experimental conditions to explore whether offering writing assistance helps participants respond to emails more efficiently (Figure 5). We find that participants in the message-level suggestions took the least time to complete the task. On average, they finished responding to all three emails in a mean time of 8.53 minutes, which is significantly faster than both participants in the control group (M = 16.40, t(78) = −3.71, p < 0.001), and participants in the sentence-level suggestions condition (M = 15.77, t(78) = −4.30, p < 0.001).8 This suggests that offering drafting suggestions has the potential to help people write responses faster.9 However, we do not observe a significant difference between the completion times of the sentence-level suggestions condition and the control condition, potentially because the time saved from generating ideas and typing sentences was offset by time spent in choosing between suggestions as well as time wasted when generated suggestions were not good enough to be used. When the time taken to select suggestions is removed, the total writing time is closer to the message-level suggestions condition (M = 11.93), though the difference between sentence-level and control is still not statistically significant at the Bonferroni corrected alpha level of 0.0125, with t(78) = 2.21, p = 0.030.
Figure 5:
Figure 5: Participants in the message-level suggestions condition tend to finish the task significantly faster than both participants in the control condition and participants who can trigger sentence-level suggestions.
Interactions with the suggestions. A central question across the conditions is how participants used the generated suggestions. We consider each response writing process—starting when the participant views a letter in the interface and ending when they save their reply—as one interaction. Most participants have three interactions, one for each letter, and when there is more than one, we choose the interaction that resulted in the saved reply. This gives 120 recorded interactions (i.e., 40 participants × 3 interactions) per experimental condition.
In the message-level suggestions condition, every participant queried the model for at least one suggestion.10 Once a participant received a suggestion, they often stuck closely to it: on average, 75.75% of the tokens in the final replies came from the suggestions, while the other 24.25% were added by the participants (Figure 6). Furthermore, in 31 (25.8%) of the interactions, participants accepted the suggestions without editing them, and only in 2 (1.67%) did a participant choose to completely remove a suggestion and write their own response.
Figure 6:
Figure 6: Average percentage of participant-written text across all the conditions. All of the control replies are participant-written while almost a quarter of the message-level replies and half of all sentence-level replies are participant-written.
The sentence-level suggestions condition has more complex interaction patterns because participants were expected to query for suggestions multiple times and in two distinct ways (either with or without highlighting text from the letter to respond to). Because of this, participants queried the model for many more suggestions: on average 3.72 suggestions with highlighting and 2.91 without per email. In contrast to the message-level suggestion condition, these suggestions were not used as often. Participants only accepted 3.32 of them per email on average, and in 9 (8%) interactions, no suggestions were used at all. We did observe a difference between acceptance rates between the queries with and without highlighting: participants accepted 60.5% of suggestions with highlighting and 36.7% of suggestions without it. Finally, participants also contributed more tokens themselves in the final response compared to their counterparts who received message-level suggestions, as only 50.60% of the tokens in the final replies originated from the suggestions they triggered (Figure 6).

4.3 Writers’ perceptions

Figure 7:
Figure 7: Participants who receive message-level suggestions (Right) tend to be more satisfied with the suggestions than participants in the sentence-level suggestions group (Left).
Figure 8:
Figure 8: Participants who receive sentence-level suggestions (Left) retain a higher level of agency (Top row, “I wrote the emails") than participants in the message-level suggestions group (Right).
To further understand how each type of assistance is perceived by the users, at the end of the experiment, we surveyed the participants about their perceptions of the suggestions they received and their level of comfort towards political communication mediated by the type of AI assistance they just experienced. The post-task survey consists of both Likert-scale questions and free-form feedback about their writing experience (See Appendix A.2 for the full list of questions).
Perceived helpfulness of the suggestions. Participants who received message-level suggestions generally agreed that the system is easy to use and that the suggestions they received were natural and useful (Figure 9, Right). However, participants in the sentence-level suggestions condition seemed to have diverging views, and did not rate the naturalness and usefulness of the suggestions as favorably (Figure 9, Left).
This contrast is also reflected in the free-form responses. Sentence-level suggestions are sometimes described as impersonal and not very natural:
“It sounds a bit automated or kind of general sounding, but so do most politicians"
“Most of the suggestions came off as impersonally and artificially uber-patriotic."
However, participants who received message-level suggestions seem quite impressed with the naturalness of the suggestions they received:
“It does sound like it were written by a human and is fully grammatically correct. When it got it right, there were barely any modifications needed on my end.
“I like how empathetic and personable the system is. At no point did I feel like these responses were from a machine. As such, I am curious to try the system out in my everyday life."
“It was quick and the suggested email was similar to what I would’ve written anyways."
A number of factors may have contributed to such differences. First, with the fixed token cap we use in our experiments, the generated suggestions may be cut short and not fully express an idea for the types of topics being discussed. Second, while the message-level suggestions have the full email as prompts, the sentence-level suggestions are generated with a more limited context and thus might be of lower quality. Future work may consider incorporating the message-level context, or even participants’ interaction history, while offering response suggestions towards specific points to further improve the quality of suggestions.
We also notice that participants in both conditions felt rather neutral about the suggestions’ capabilities in inspiring arguments they had not thought of (Figure 9), pointing to an area for future improvement for the generation models.
How did the suggestions help? Participants who received either type of assistance expressed how they liked that the suggestions served as starting points, as arguably the hardest part of the writing process is the beginning:
“I liked that it gave me suggestions for how to start out when I needed inspiration."
“It was extremely handy especially when you dont know what to say or how to word your reply. "
“It is always easier to edit something than write it, even if the starting point is bad—these were solid though."
However, as text suggestions were presented in very different forms, participants made use of the suggestions in different ways. While participants who received sentence-level suggestions tended to find suggestions helpful in making them keep a more professional tone in their response:
“It was easier to keep a professional and political tone, and to quickly generate generic sentences."
“I like that it guided me to answer the letters in a professional manner."
Participants who receive message-level suggestions mainly commented on the usefulness of the draft serving as an outline for further editing:
“With a single click, I had an entire outline for an email, with minimal adjustments to be made."
“It gave a good base outline of how to respond that I could then use to expand upon and put emphasis on things that were really important to the topic."
Figure 9:
Figure 9: Responses written with message-level suggestions were rated as significantly more helpful than those written with sentence-level suggestions and even than responses written without AI assistance. N=500 judgements per data point. Error bars represent 95% CIs. The Y-axis indicates the mean helpfulness ratings participants rated a reply with depending on the type of reply shown in the X-axis. Replies shown with an explicit disclosure of the AI involvement are color-coded in blue.
Writers’ Agency. While participants in the sentence-level suggestions condition perceived that they retain substantial agency (Figure 8, Left), participants who received message-level suggestions tended to think that they played a lesser role in drafting responses (Figure 8, Right). This echoes our earlier observation that participants in the sentence-level condition contributed a much higher percentage of tokens than participants receiving message-level assistance. The autonomy granted to the AI has previously been identified as a key dimension in characterizing AI-mediated communication [21]. The contrast in the perceived level of agency we observe in our experiment further demonstrates the need to clarify the desired level of agency for writers to retain in order to design appropriate assistance systems.
Likelihood of future use. To explore potential receptions from both the writers’ and the receivers’ perspectives, we asked the participants not only how willing they would be to use such a system to respond to their emails, but also how comfortable they would be if their legislators were to use such a system to respond to their emails (Figure 8). We find that participants tend to feel more hesitant about the platform if they were on the receiving end. The participants in the message-level suggestions condition expressed a rather strong willingness to use similar assistance to respond to their own emails (Figure 8, Fourth row), but they did not feel as comfortable with their legislators using such a system to reply to them (Figure 8, Last row). This suggests that beyond the effectiveness of the assistance, care must be taken for introducing and disclosing the use of such systems to people on both ends of the communication process to avoid tension and mistrust.

4.4 Readers’ perceptions

Following the main study, we conducted a follow-up study to understand how readers would perceive the replies written with the Dispatch system. In the follow-up study, we took replies participants had written in the main study and asked a separate set of crowdworkers how helpful the replies were. In addition to the replies written by the main experiment participants, we also evaluated the helpfuless of message-level suggestions generated by GPT-3 as described in the main experiment without any human editing. To this set of replies, we added a sample of generic auto-replies that legislators sent to real-world inquiries in a previous field study from the Cornell Tech Policy Institute.
We recruited 1,000 participants on Prolific [36] to evaluate these replies to legislative inquiries. We developed a mock-up of an email conversation displaying both the citizen concerns that main experiment participants had responded to as well as a specific reply. Each participant rated one reply written with sentence-level suggestions, one reply written with message-level suggestions, one reply generated by GPT-3 without human editing as well as one reply that was either a generic auto-reply or written by a human without AI assistance. In addition, half of the participants saw a disclosure label stating that “Elements of this reply were generated by an AI communication tool.” when they saw replies that had been written either by GPT-3 itself or with the help of GPT-3. For each reply, we asked participants whether they agreed with the statement that “The reply is helpful and reasonable” on a 5-point Likert-scale from “Disagree” to “Agree”. For a statistical analysis, we converted their responses to a numeric scale from 0 to 1 respectively and conducted a linear regression analysis with human-written replies as the baseline.
The results are shown in Figure 9. When evaluating the replies participants had written in the main study, participants in the follow-up task indicated that replies written with message-level suggestions (M=0.69, shown central in the right panel) were more helpful than those replies people had written without AI assistance (left in left panel, M = 0.57, t(973) = −5.11, p < 0.0001). Replies that were written with sentence-level suggestions (M=0.60, left in right panel) were seen as less helpful than those written with message-level suggestions and similarly helpful to those written without AI assistance. Replies that GPT-3 generated without human supervision (right in right panel) were seen as slightly more helpful than replies that people had written without AI assistance (M = 0.62, t(979) = −1.92, p = 0.054). In comparison, the generic auto-replies (right in left panel) that busy legislators may send to cope with an overwhelming volume of inquiries were rated as very unhelpful (M = 0.24, t(942) = 16.2, p < 0.0001). Explicitly disclosing the involvement of AI in the reply generation (shown in blue) may have reduced the perceived helpfulness of replies generated with message-level suggestions (M = 0.65, t(998) = 1.75, p = 0.08) and of replies autonomously generated by GPT-3 (M = 0.56, t(995) = 2.67, p = 0.07). However, even when the AI involvement was explicitly disclosed, replies written with message-level suggestions were seen as significantly more helpful than replies written with sentence-level suggestions.

4.5 Characteristics of the responses

While we have discussed the effects of suggestion type on the writing process, we are also interested in how the suggestion type affects the final written product itself. In particular, we investigate three aspects:
Length. We compare the number of words in the responses under different assistance conditions. We find that participants in the control condition produced the longest responses, averaging 115.8 words. The average length of responses from the sentence-level suggestions condition and message-level suggestions condition were both significantly shorter than responses from the control condition, at an average of 90.5 words (t(238) = −4.94, p < 0.001) and 88.0 words (t(238) = −4.29, p < 0.001) respectively.11 This is counter-intuitive as one might expect participants with access to suggestions to write more, but this is not the case. The reasons for this result are unclear, but one possibility is that participants in the message-level suggestions condition anchored very strongly to the length of the generated suggestions, and were less likely to add more content. As for participants in the sentence-level suggestions condition, they might have expended additional energy in deciding where to trigger suggestions and choosing which suggestions to use, leading them to spend less time writing.12
Grammaticality. We compared the grammaticality of responses written under different conditions. To do this, we computed the error rate of the responses as the number of grammatical errors divided by the number of words in the response. Following the methods from prior work [31], we used LanguageTool to identify grammatical errors in the responses.13 Similar to previous studies [14, 31], we find that responses from the message-level suggestions condition have the lowest error rate, averaging 0.158 errors per word. The responses from the sentence-level condition have a slightly higher average error rate, at 0.165 errors per word. The purely human-written control responses ended up having the highest error rate of 0.176 errors per word, which is significantly higher to both the sentence-level condition (t(238) = 2.64, p < 0.01) and the message-level condition (t(238) = 4.31, p < 0.001).
Vocabulary diversity. Vocabularity diversity is a proxy for how engaging or interesting the generations are, and to measure it, we use the distinct-2 score [32], i.e., the number of unique bigrams / the total number of words in the response. NLP model generations tend to be less diverse than human written text [32, 47], which is reflected in our results: the control responses have higher distinct-2 scores (M = 0.944) than both the responses from sentence-level suggestions condition (M = 0.928) and those from message-level suggestions condition (M = 0.934), although only the difference between the control responses and the responses from the sentence-level suggestions condtition is significant (t(238) = 2.91, p < 0.01).

5 Design Implications

In this work, we explore the effects of sentence-level vs. message-level suggestions in assisting users in email communication. We observe that different forms of suggestions lead to substantially different writing processes: participants receiving message-level suggestions mostly edited the drafts presented to them, skipping the first two steps in the traditional “outline-draft-edit" process [19] that participants who receive sentence-level suggestions still seem to go through. As a result, participants who received message-level suggestions finished their responses faster while participants who received sentence-level suggestions retained a higher sense of agency in the process.
This contrast, together with other observations we notice in our experiment, suggests that as more technically advanced options become feasible, each with its own advantages and shortcomings, it is all the more important to better understand the needs and specifications of the particular communication circumstance to design task-appropriate assistance systems. Below, we outline a few technical options that may be considered and adjusted according to the communication circumstance.
Unit of suggestion. Suggestions can be offered at different units and lengths. While we experiment with suggestions at the level of short sentences and messages, there are intermediate forms as well, e.g., longer sentences or even paragraphs. As we have observed, this choice affects not just the text entry efficiency, but more fundamentally, the writing process itself.14 The more complete a draft the suggested text offers, the more the users’ focus may be shifted towards editing and less towards outlining and drafting, implying different degrees of delegation of the writing process. Hence, finding the appropriate suggestion unit requires finding the sweet spot that tailors both to the communication topic, as different topics may take different amount of text to fully develop and express an idea, and also to communicators’ willingness to delegate the task. For instance, prior research reports that people prefer less machine assistance when writing a birthday card to their mothers compared to when responding to mundane work emails [34].
Availability of options. Participants who received sentence-level suggestions were provided with five candidate options whenever they triggered a suggestion. The availability of choices could potentially allow users more flexibility and increase the chance of offering at least one useful suggestion. In fact, some of the participants receiving message-level suggestions expressed their interest in receiving more candidate responses, e.g., “maybe have it give you a choice of 2 or 3 different responses"). However, reading and deciding between candidate suggestions can sometimes be distracting and can take a considerable amount of time, as we have seen earlier (i.e. in Figure 5). Furthermore, offering choices is perhaps only beneficial when a set of diverse and complementary suggestions can be generated. In our experiment, the sentence-level suggestions were sometimes too similar, leading to complaints such as “some of the suggestions were repetitive, or out of context" and “The responses were too similar. Most began with ‘I agree!’". In addition, in communication settings when we expect a relatively narrow range of possible responses, like answering a factual question, having multiple options may not be needed or wanted.
Generation model. In this work, we used the best available model at the time (GPT-3) to generate both sentence-level and message-level suggestions, as we are primarily concerned with how humans interact with different types of assistance and we hope to make as fair a comparison as possible. Some of the challenges we observe with generating sentence-level suggestions, i.e., limited input context and limited space to fully develop an argument, are inherent to the task itself and would likely remain even if a different model were used.
Observations and limitations from our studies also point to a number of ways models could be improved to better faciliate such communication process. For instance, the priming effects we observe—participants who recieve message-level assistance write shorter messages than those in the control group because the suggestions are shorter—suggest that fine-tuning generation models to exhibit specific properties, e.g. a particular length range or a formal tone could be helpful. Furthermore, while we have made the distinction based on party affiliation when generating suggestions, legislators can have much more finer-grained differences in their policy stances and communication styles. Personalizing suggestions based on policy stances and communication styles of the legislators is another important avenue for future work.
Beyond text suggestions. While we explore assistance options involving text suggestions, communication assistance systems can help in more aspects of the writing process and offer more than merely generating suggestions for content. Quoting from our participants’ suggestions, additional assistance could range from “highlighting key points and adding blank spaces to share personal opinions and ideas” to help with outlining, “making it easier to see which text the system created and which text was typed by me" to help with reviewing, tracking “responses I made earlier about a similar topic" for future use, or providing related contextual information by generating “a quick tutorial on the subject".
Disclosure of assistance. While we have focused on the writers’ perspective, it is important to remember that successful communication is not just about replying to all of the emails. In many cases, such as in legislator-constituent communication, it is more important to build trust and understanding between people who are communicating. In our experiment, even participants who expressed relatively strong interest in using assistance systems to reply to emails were hesitant about having their legislators use the same system. As such, if such a system were to be incorporated into staffers’ workflow, it is important to consider how to disclose and explain its use to avoid further friction and mistrust between constituents and legislators.
These decisions do not have purely technical solutions. While we attempt to lay out feasible technical options, ultimately, we hope to facilitate the communication processes, and it should be up to the communicators themselves—i.e., staffers, legislators, and constituents in the context of legislator-constituent communication—to make the important value judgments on what they feel comfortable delegating to assistance systems.

6 Conclusion

In this work, we explored two assistance options enabled by the capability of recent large language models to generate long, natural-sound suggestions: sentence-level suggestions and message-level suggestions. To understand the trade-offs between these two types of suggestions, we conducted an online experiment via dispatch, a platform we built to simulate the scenario of staffers from legislative offices responding to constituents’ concerns. The results show that different forms of suggestions can affect the participants’ writing experience in multiple dimensions. For instance, participants receiving message-level suggestions mainly edited the suggested responses and were able to complete the task significantly faster, while participants receiving sentence-level suggestions retained a higher sense of agency and contributed more original content. We discussed the implications of our observations for designing assistance systems tailored specifically to the communication circumstance.
Different communication circumstances have different objectives and demands. Efficiency may be at the core of customer services communications whereas developing trust and credibility is critical for legislator-constituent communication. This work has provided an initial proof-of-concept that we hope will encourage further exploration of communication assistance systems beyond the specific domain studied here. For example, while we targeted users fluent in English, these systems could be even more beneficial to those less proficient in the language. Studying the utility and effectiveness for non-native English speakers would be a fruitful extension of this research. Recent work has demonstrated the capacity of language models to parse ideological nuance, which would be especially important in environments with political polarization where espousing the “wrong” position could alienate voters, although that dynamic was outside the scope of the current study and should be considered for follow-on research.
To conclude, in this work we shed light on the factors relevant for writing assistance systems for legislator-constituent communication. We hope our work encourages further studies towards designing task-appropriate communication assistance systems.
Acknolwedgement. We thank the anonymous reviewers for their helpful comments and Nikhil Bhatt, Gloria Cai, Paul Lushenko, Meredith Moran, Tanvi Namjoshi, Shyam Raman, Aryan Valluri, Ella White for their help with internal testing. This research was supported by a New Frontier Grant from the College of Arts and Sciences at Cornell.

A Task Details

A.1 Instructions

Instructions for how different types of suggestions can be triggered are shown below.
Sentence-level suggestions
When drafting your responses, you can trigger two types of response suggestions:
1. HIGHLIGHT a sentence in the letter and TYPE "@" in the editor to trigger suggestions that directly respond to the sentence.
2. TYPE "@" in the editor without highlighting to trigger suggestions for how to continue what you’re writing.
Message-level suggestions
To trigger a suggested reply from the AI assistant, press the Generate button under the left panel. You can then edit the generated email to your liking.

A.2 Survey questions

1. Choose the degree to which you agree with the following statements:
The system was easy to use.
The system’s suggestions sound natural.
The system’s suggestions were useful.
The system’s suggestions inspired me to include points I hadn’t thought of.
2. Choose the extent to which you agree with the following statements:
I wrote the emails.
I was able to respond to emails faster than normal.
I’m satisfied with the amount of assistance I received from the system.
I would like to respond to emails using this system in the future.
I would be comfortable with my legislator using a system like this to respond to my emails.
3. What did you like about the experience of responding to emails using the system?
4. What would you change about the system to improve your experience responding to emails?

B Additional Results

Adaptation. In both of these conditions, we were also interested in how the participants’ use of the system changed as they drafted each message. Table 1 shows the percentage of tokens from the suggestion or from the participant over the first, second, and third replies written. In the message-level suggestions condition, participants included slightly more suggestion tokens in the first message than the later ones, while in the sentence-level suggestions condition, suggestions were used more in the second message. That said, these differences are small and suggest that participant behavior was consistent across all interactions. Future work might investigate each participant drafting more messages to see if there is any adaptation behavior.
Table 1:
 Letter Index
Condition123
control100.00 ± 0.00100.00 ± 0.00100.00 ± 0.00
sentence51.28 ± 28.2148.91 ± 31.0651.10 ± 30.19
email20.47 ± 20.8624.25 ± 24.5424.51 ± 23.36
Table 1: The percentage of tokens in the reply that come from the human participant across the three emails each participant wrote.
Table 2:
email idcontrolmessage-levelsentence-level
0100.0020.6641.31
1100.0026.8143.96
2100.007.7749.70
3100.0031.9066.07
4100.0011.8238.36
5100.0029.7140.40
6100.0017.3045.58
7100.0035.9266.69
8100.0021.6349.27
9100.0026.9935.96
10100.0020.7354.91
11100.0025.6972.92
all100.0024.2549.40
Table 2: Average percentage of human-written text for each message across all the conditions. All of the control replies are human-written while almost a quarter of the message-level replies and half of all sentence-level replies are human-written.
Figure 10:
Figure 10: The average number of suggestions prompted with and without highlighting a portion of the message across the three emails each participant wrote.
However, there was a difference in the types of suggestions triggered in the sentence-level condition. When writing the final message, participants more often prompted the model for suggestions without highlighting any text (Figure 10).
Effect of message. One concern we might have is that the different messages loaned themselves to better suggestions. To investigate this, we looked at the percentage of human-written tokens for each of the messages across all of the conditions (Table 2). We found that overall, there was not too much variation among the ten samples of each message, with messages 3, 7, and 11 (for the sentence-level suggestions) having the most human-written tokens across both conditions.

Footnotes

1
In our context, message-level suggestion means a full draft for responding to an email.
2
We use 30 tokens before the current cursor position as the prompt.
3
Note that the two types of suggestions are generated independently, i.e., the sentence-level suggestions are not intentionally a subset of the message-level suggestions although there can be coincidental overlaps depending on how participants trigger suggestions.
7
The full set of letters is included in the Supplementary Materials.
8
Throughout, we use independent-samples t-test with Bonferroni correction. In this subsection, as we make four comparisons, a Bonferroni corrected alpha level of 0.0125 is used.
9
We recognize that the faster response time may be partially attributed to shorter replies. However, the trend remains similar even if response lengths are taken into account, i.e., if we consider time taken per word.
10
In 11 interactions, a participant queried twice, and more (three, five, and six suggestions) were queried only once each.
11
We set max_tokens to 200 for generating message-level suggestions, but the generated suggestions have, on average, 74.3 words.
12
It is also possible that the suggestions in both settings packed more information into a smaller number of words while the human writers were unnecessarily verbose.
13
We use the Python wrapper for computation: https://github.com/jxmorris12/language_tool_python.
14
In fact, prior work suggests that even the difference between word-level and phrase-level suggestions may have such effect [2].

Supplementary Material

Supplemental Materials (3544548.3581351-supplemental-materials.zip)
MP4 File (3544548.3581351-talk-video.mp4)
Pre-recorded Video Presentation

References

[1]
Sinan Aral and Dean Eckles. 2019. Protecting elections from social media manipulation. Science 365, 6456 (2019), 858–861.
[2]
Kenneth C. Arnold, Krzysztof Z. Gajos, and Adam T. Kalai. 2016. On Suggesting Phrases vs. Predicting Words for Mobile Text Composition. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, Tokyo Japan, 603–608. https://doi.org/10.1145/2984511.2984584
[3]
Nikola Banovic, Ticha Sethapakdi, Yasasvi Hari, Anind K. Dey, and Jennifer Mankoff. 2019. The Limits of Expert Text Entry Speed on Mobile Keyboards with Autocorrect. In Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services (Taipei, Taiwan) (MobileHCI ’19). Association for Computing Machinery, New York, NY, USA, Article 15, 12 pages. https://doi.org/10.1145/3338286.3340126
[4]
Pablo Barberá, Andreu Casas, Jonathan Nagler, Patrick J Egan, Richard Bonneau, John T Jost, and Joshua A Tucker. 2019. Who leads? Who follows? Measuring issue attention and agenda setting by legislators and the mass public using social media data. American Political Science Review 113, 4 (2019), 883–901.
[5]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 610–623.
[6]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258(2021).
[7]
Samantha Bradshaw and Philip Howard. 2017. Troops, trolls and troublemakers: A global inventory of organized social media manipulation. (2017).
[8]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
[9]
Axel Bruns. 2019. Are filter bubbles real?John Wiley & Sons.
[10]
Daniel Buschek, Benjamin Bisinger, and Florian Alt. 2018. ResearchIME: A Mobile Keyboard Application for Studying Free Typing Behaviour in the Wild. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems(CHI ’18). ACM, New York, NY, USA. https://doi.org/10.1145/3173574.3173829 event-place: Montreal, Quebec, CA.
[11]
Daniel Buschek, Martin Zürn, and Malin Eiband. 2021. The Impact of Multiple Parallel Phrase Suggestions on Email Input and Composition Behaviour of Native and Non-Native English Writers. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 732, 13 pages. https://doi.org/10.1145/3411764.3445372
[12]
Mia Xu Chen, Benjamin N. Lee, Gagan Bansal, Yuan Cao, Shuyuan Zhang, Justin Lu, Jackie Tsay, Yinan Wang, Andrew M. Dai, Zhifeng Chen, Timothy Sohn, and Yonghui Wu. 2019. Gmail Smart Compose: Real-Time Assisted Writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2287–2295. https://doi.org/10.1145/3292500.3330723
[13]
Congressional Management Foundation. Retrieved in 2022. Communicating with Congress. https://www.congressfoundation.org/projects/communicating-with-congress.
[14]
Yao Dou, Maxwell Forbes, Rik Koncel-Kedziorski, Noah Smith, and Yejin Choi. 2022. Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.501
[15]
Mark Dunlop and John Levine. 2012. Multidimensional Pareto Optimization of Touchscreen Keyboards for Speed, Familiarity and Improved Spell Checking. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, 2669–2678. https://doi.org/10.1145/2207676.2208659
[16]
Congressional Management Foundation. 2017. Handling Volume. https://www.congressfoundation.org/office-toolkit-home/improve-mail-operations-menu-item-new/handling-volume-home/terms/summary. Accessed: 2022-11-21.
[17]
Andrew Fowler, Kurt Partridge, Ciprian Chelba, Xiaojun Bi, Tom Ouyang, and Shumin Zhai. 2015. Effects of Language Modeling and Its Personalization on Touchscreen Typing Performance. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (Seoul, Republic of Korea) (CHI ’15). Association for Computing Machinery, New York, NY, USA, 649–658. https://doi.org/10.1145/2702123.2702503
[18]
Joshua Goodman, Gina Venolia, Keith Steury, and Chauncey Parker. 2002. Language modeling for soft keyboards. (Jan. 2002), 194–195. https://doi.org/10.1145/502716.502753
[19]
Francis J. Griffith and John Warriner. 1977. English Grammar and Composition: Complete Course.
[20]
Christian R. Grose, Neil Malhotra, and Robert Parks Van Houweling. 2015. Explaining Explanations: How Legislators Explain their Policy Positions and How Citizens React. American Journal of Political Science 59, 3 (2015), 724–743. https://doi.org/10.1111/ajps.12164 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/ajps.12164
[21]
Jeffrey T. Hancock, Mor Naaman, and Karen Levy. 2020. AI-Mediated Communication: Definition, Research Agenda, and Ethical Considerations. Journal of Computer-Mediated Communication 25, 1 (March 2020). https://doi.org/10.1093/jcmc/zmz022
[22]
ALEXANDER HERTEL-FERNANDEZ, MATTO MILDENBERGER, and LEAH C. STOKES. 2019. Legislative Staff and Representation in Congress. American Political Science Review 113, 1 (2019), 1–18. https://doi.org/10.1017/S0003055418000606
[23]
Po-Sen Huang, Huan Zhang, Ray Jiang, Robert Stanforth, Johannes Welbl, Jack Rae, Vishal Maini, Dani Yogatama, and Pushmeet Kohli. 2019. Reducing sentiment bias in language models via counterfactual evaluation. arXiv preprint arXiv:1911.03064(2019).
[24]
Maurice Jakesch, Jeffrey Hancock, and Mor Naaman. 2022. Human Heuristics for AI-Generated Language Are Flawed. arXiv preprint arXiv:2206.07271(2022).
[25]
Christina James and Michael Longé. 2000. Bringing Text Input beyond the Desktop. In CHI ’00 Extended Abstracts on Human Factors in Computing Systems (The Hague, The Netherlands) (CHI EA ’00). Association for Computing Machinery, New York, NY, USA, 49–50. https://doi.org/10.1145/633292.633324
[26]
Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Peter Young, and Vivek Ramavajjala. 2016. Smart Reply: Automated Response Suggestion for Email. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 955–964. https://doi.org/10.1145/2939672.2939801
[27]
Habibul Haque Khondker. 2011. Role of the new media in the Arab Spring. Globalizations 8, 5 (2011), 675–679.
[28]
Sarah Kreps. 2022. AI-Mediated Communication, Legislative Responsiveness, and Trust in Democratic Institutions. In Technologies of Deception.
[29]
Sarah Kreps, R Miles McCain, and Miles Brundage. 2022. All the news that’s fit to fabricate: AI-generated text as a tool of media misinformation. Journal of Experimental Political Science 9, 1 (2022), 104–117.
[30]
Per Ola Kristensson and Keith Vertanen. 2014. The inviscid text entry rate and its application as a grand goal for mobile text entry. In Proceedings of the 16th international conference on Human-computer interaction with mobile devices & services(MobileHCI ’14). Association for Computing Machinery, Toronto, ON, Canada, 335–338. https://doi.org/10.1145/2628363.2628405
[31]
Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 388, 19 pages. https://doi.org/10.1145/3491102.3502030
[32]
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics. https://doi.org/10.18653/v1/N16-1014
[33]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958(2021).
[34]
Brian Lubars and Chenhao Tan. 2019. Ask Not What AI Can Do, but What AI Should Do: Towards a Framework of Task Delegability. (2019).
[35]
Debora Nozza, Federico Bianchi, and Dirk Hovy. 2021. HONEST: Measuring hurtful sentence completion in language models. In The 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
[36]
Stefan Palan and Christian Schitter. 2018. Prolific. ac—A subject pool for online experiments. Journal of Behavioral and Experimental Finance 17 (2018), 22–27.
[37]
Kseniia Palin, Anna Maria Feit, Sunjun Kim, Per Ola Kristensson, and Antti Oulasvirta. 2019. How do People Type on Mobile Devices? Observations from a Study with 37,000 Volunteers. In Proceedings of the 21st International Conference on Human-Computer Interaction with Mobile Devices and Services(MobileHCI ’19). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3338286.3340120
[38]
Philip Quinn and Shumin Zhai. 2016. A Cost-Benefit Study of Text Entry Suggestion Interaction. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (San Jose, California, USA) (CHI ’16). Association for Computing Machinery, New York, NY, USA, 83–88. https://doi.org/10.1145/2858036.2858305
[39]
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446(2021).
[40]
Nikhil Singh, Guillermo Bernal, Daria Savchenko, and Elena L. Glassman. 2022. Where to Hide a Stolen Elephant: Leaps in Creative Writing with Multimodal Machine Intelligence. ACM Transactions on Computer-Human Interaction (Feb. 2022), 3511599. https://doi.org/10.1145/3511599
[41]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243(2019).
[42]
The OpenGov Foundation. 2017. From Voicemails to Votes. Technical Report.
[43]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[44]
Keith Vertanen, Haythem Memmi, Justin Emge, Shyam Reyal, and Per Ola Kristensson. 2015. VelociTap: Investigating Fast Mobile Text Entry using Sentence-Based Decoding of Touchscreen Keyboard Input. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems(CHI ’15). Association for Computing Machinery, Seoul, Republic of Korea, 659–668. https://doi.org/10.1145/2702123.2702135
[45]
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, 2021. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359(2021).
[46]
Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, 2022. Taxonomy of risks posed by language models. In 2022 ACM Conference on Fairness, Accountability, and Transparency. 214–229.
[47]
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2020. Neural Text Generation with Unlikelihood Training. ArXiv abs/1908.04319(2020).
[48]
Sophie Wodzak. 2022. Can a Standardized Test Actually Write Itself?
[49]
Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. 2022. Wordcraft: Story Writing With Large Language Models. In 27th International Conference on Intelligent User Interfaces(IUI ’22). Association for Computing Machinery, New York, NY, USA, 841–852. https://doi.org/10.1145/3490099.3511105
[50]
Ekaterina Zhuravskaya, Maria Petrova, and Ruben Enikolopov. 2020. Political effects of the internet and social media. Annual Review of Economics 12 (2020), 415–438.

Cited By

View all
  • (2025)Using AI to Care: Lessons Learned from Leveraging Generative AI for Personalized Affective-Motivational FeedbackInternational Journal of Artificial Intelligence in Education10.1007/s40593-024-00455-5Online publication date: 15-Jan-2025
  • (2024)Perceptions of Professionalism and Authenticity in AI-Assisted WritingBusiness and Professional Communication Quarterly10.1177/23294906241233224Online publication date: 11-Mar-2024
  • (2024)(Dis)placed Contributions: Uncovering Hidden Hurdles to Collaborative Writing Involving Non-Native Speakers, Native Speakers, and AI-Powered Editing ToolsProceedings of the ACM on Human-Computer Interaction10.1145/36869428:CSCW2(1-31)Online publication date: 8-Nov-2024
  • Show More Cited By

Index Terms

  1. Comparing Sentence-Level Suggestions to Message-Level Suggestions in AI-Mediated Communication
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems
        April 2023
        14911 pages
        ISBN:9781450394215
        DOI:10.1145/3544548
        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 19 April 2023

        Check for updates

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        CHI '23
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

        Upcoming Conference

        CHI 2025
        ACM CHI Conference on Human Factors in Computing Systems
        April 26 - May 1, 2025
        Yokohama , Japan

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)1,603
        • Downloads (Last 6 weeks)178
        Reflects downloads up to 26 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)Using AI to Care: Lessons Learned from Leveraging Generative AI for Personalized Affective-Motivational FeedbackInternational Journal of Artificial Intelligence in Education10.1007/s40593-024-00455-5Online publication date: 15-Jan-2025
        • (2024)Perceptions of Professionalism and Authenticity in AI-Assisted WritingBusiness and Professional Communication Quarterly10.1177/23294906241233224Online publication date: 11-Mar-2024
        • (2024)(Dis)placed Contributions: Uncovering Hidden Hurdles to Collaborative Writing Involving Non-Native Speakers, Native Speakers, and AI-Powered Editing ToolsProceedings of the ACM on Human-Computer Interaction10.1145/36869428:CSCW2(1-31)Online publication date: 8-Nov-2024
        • (2024)TwIPS: A Large Language Model Powered Texting Application to Simplify Conversational Nuances for Autistic UsersProceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3663548.3675633(1-18)Online publication date: 27-Oct-2024
        • (2024)Beyond the Chat: Executable and Verifiable Text-Editing with LLMsProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676419(1-23)Online publication date: 13-Oct-2024
        • (2024)In Whose Voice?: Examining AI Agent Representation of People in Social Interaction through Generative SpeechProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661555(224-245)Online publication date: 1-Jul-2024
        • (2024)The Value, Benefits, and Concerns of Generative AI-Powered Assistance in WritingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642625(1-25)Online publication date: 11-May-2024
        • (2024)Route Chat Connect: Empowering Collaborative Travel Planning and Social Connection2024 2nd International Conference on Networking and Communications (ICNWC)10.1109/ICNWC60771.2024.10537282(1-9)Online publication date: 2-Apr-2024
        • (2024)The Unexpected Effects of Google Smart Compose on Open-Ended Writing TasksArtificial Intelligence in Education10.1007/978-3-031-64302-6_33(468-481)Online publication date: 2-Jul-2024
        • (2024)A Map of Exploring Human Interaction Patterns with LLM: Insights into Collaboration and CreativityArtificial Intelligence in HCI10.1007/978-3-031-60615-1_5(60-85)Online publication date: 29-Jun-2024
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Login options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media