1 Introduction
Traditional communication assistance systems have generally focused on short suggestions to improve input efficiency. With the emergence of large language models such as GPT-3 [
8], it has become possible to generate significantly longer natural-sounding text suggestions, opening up opportunities to design more advanced writing assistance to help humans with more complex tasks in more substantial ways
[31, 48]. Such assistance can be especially helpful in communication scenarios in which a single point of contact needs to manage large volumes of correspondence, e.g., customer service representatives addressing customers’ queries, professors attending to students’ emails, as well as elected officials responding to their constituents’ concerns.
The generative capabilities of current models enable a wide range of possibilities for designing assistance systems. In this work, we explore two writing assistant design choices, namely sentence-level and message-level text suggestions, and empirically analyze the trade-offs between them in email communication.
1 We consider the practical scenario of staffers from legislators’ offices responding to vast amounts of constituents’ concerns as the context for our study. In this context, the volume of correspondence can become overwhelming, making intelligent assistance especially needed [
13,
42]. At the same time, the high-stakes nature of political communication calls for more careful and comprehensive research to better understand the potential benefits and risks of any type of technical assistance that may be introduced to the existing workflow.
To advance our understanding of different assistance options, we develop
dispatch (Section
3), an application that can serve as a platform to simulate the process of a staffer responding to constituents’ emails, allowing us to design and set up an online experiment to test different types of suggestions (Section
4). We recruited 120 participants to act as staffers from legislators’ offices to respond to three emails expressing constituents’ concerns under three different conditions: 40 participants received no assistance, 40 received sentence-level suggestions, and 40 message-level suggestions, with both types of suggestions generated by GPT-3.
By observing participants’ interactions with text suggestions and surveying their perceptions of the assistance they received, we are able to compare how sentence-level suggestions and message-level suggestions may affect the participants’ writing experience as well as the eventual responses produced. The results show that participants who receive message-level suggestions generally found the suggested drafts natural and mainly edited on top of them. They were able to finish their responses significantly faster, demonstrating an increased level of efficiency in responding, and were generally more satisfied with the assistance they received. In comparison, participants receiving sentence-level assistance took longer since they still had to plan for the key points to cover in their responses, while deciding where to trigger and use suggestions. While they retained a higher sense of agency, they spent longer time and reported lower levels of satisfaction, demonstrating the challenging nature of designing assistance systems with finer-grained control. This discrepancy points to the need to take receivers’ perspectives into account when designing and introducing assistance systems into the workflow in communication circumstance where trust is highly valued. We discuss the implications of our experimental results for designing task-appropriate communication assistance systems (Section
5).
3 Designing Dispatch
To understand the trade-offs between message-level and sentence-level suggestions, we built
dispatch, a platform to simulate the scenario of staffers from legislative offices responding to constituents’ concerns. The basic
dispatch interface is shown in Figure
1. A letter from a constituent is displayed on the left, while an editor for drafting the response is provided on the right. The editor supports all typical actions for writing, e.g., typing, deleting, and cursor movements.
On top of this basic interface, we build two different versions of dispatch, one that offers sentence-level suggestions and another that offers message-level suggestions.
Sentence-level suggestions. We allow users to trigger two types of sentence-level suggestions. First, users can receive suggestions responding to specific points raised in the constituent letter by highlighting the sentence to respond to and then typing “@" in the editor (Figure
2). Second, users can trigger suggestions to continue the text they have already written
2 by typing “@" without highlighting any text. In both cases, users are presented with a drop-down menu displaying five candidate suggestions, from which they
can select one
or none.
Message-level suggestions. In the message-level suggestion interface, users can click on the “Generate" button to obtain a full response draft. The suggestion is directly loaded into the editor (Figure
3) for users to make further edit.
3 4 System Evaluation
4.1 Experiment setup
Constituents’ letters. As proxies for constituents’ emails, we sample open letters delivered to elected officials in the United States through Resistbot, a service that advertises the ability to compose and send letters to legislators in less than two minutes.
4 We obtain images of these open letters by retrieving tweets published by @openletterbot
5 using the Twitter API and extract the contents of the letters using Python Tesseract.
6 For each letter, we keep only the content of the letter, removing the sender’s first name and the state they are from, if applicable. In addition, we only consider letters that are sent by multiple people to ensure that the letters are
representative but not personally identifiable.
To select letters to use as prompts for participants to respond, we consider topics that are generally relatable but not overtly polarizing.
Our choice is based on two considerations. First, common concerns constitute a significant portion of the emails legislators need to address, as we observe a substantial number of near-duplicate open letters expressing similar concerns. Second, more general topics would make the task manageable to our participants who might not be familiar with very niche topics. While staffers would also have to respond to more specific questions, focusing on common ones is sufficient for exploring the difference between the two types of suggestion types. The twelve letters we select span topics such as health insurance, climate change, and COVID relief policies.
7 The average length of the selected letters is 97 words.
Experimental conditions. In our experiment, we randomly assign participants to one of three conditions:
(1)
Control: participants respond to emails in their own writing, with no response suggestions.
(2)
Sentence-level suggestions: participants have the option to trigger sentence-level suggestions, with the option to accept and edit them.
(3)
Message-level suggestions: participants have the option to trigger an automatically generated full email response draft to edit.
Generation model. We use GPT-3 (specifically, the text-davinci-002 model without any fine-tuning) to generate suggestions. We set max_tokens to 200 when generating full email drafts and 20 when generating sentence-level suggestions. Under both conditions, we set temperature to 0.7 and top_p to 0.96 throughout. Our application was reviewed and approved by OpenAI before launching the experiment.
Participants. We recruit 40 participants for each experimental condition via the platform Prolific [
36]. We only consider participants who are located in the United States, fluent in English, and have listed “politics” as one of their hobbies. Each eligible participant is allowed to take part in at most one condition. We pay $5.00 for each task session based on an estimated completion time of 20 minutes. The actual completion time in each condition is shown in Figure
5. The experiment received Institutional Review Board approval from Cornell University.
4.2 Writers’ experience
To understand the participants’ writing processes, we track both the overall time they spend on the task, the suggestions they trigger, and the responses they submit. Figure
4 shows sample responses annotated with participants’ interactions with the text suggestions.
Completion time. We compare the average completion time across three experimental conditions to explore whether offering writing assistance helps participants respond to emails more efficiently (Figure
5). We find that participants in the message-level suggestions took the least time to complete the task. On average, they finished responding to all three emails in a mean time of 8.53 minutes, which is significantly faster than both participants in the control group (
M = 16.40,
t(78) = −3.71,
p < 0.001), and participants in the sentence-level suggestions condition (
M = 15.77,
t(78) = −4.30,
p < 0.001).
8 This suggests that offering drafting suggestions has the potential to help people write responses faster.
9 However, we do not observe a significant difference between the completion times of the sentence-level suggestions condition and the control condition, potentially because the time saved from generating ideas and typing sentences was offset by time spent in choosing between suggestions as well as time wasted when generated suggestions were not good enough to be used. When the time taken to select suggestions is removed, the total writing time is closer to the message-level suggestions condition (
M = 11.93), though the difference between sentence-level and control is still not statistically significant at the Bonferroni corrected alpha level of 0.0125, with
t(78) = 2.21,
p = 0.030.
Interactions with the suggestions. A central question across the conditions is how participants used the generated suggestions. We consider each response writing process—starting when the participant views a letter in the interface and ending when they save their reply—as one interaction. Most participants have three interactions, one for each letter, and when there is more than one, we choose the interaction that resulted in the saved reply. This gives 120 recorded interactions (i.e., 40 participants × 3 interactions) per experimental condition.
In the message-level suggestions condition, every participant queried the model for at least one suggestion.
10 Once a participant received a suggestion, they often stuck closely to it: on average, 75.75% of the tokens in the final replies came from the suggestions, while the other 24.25% were added by the participants (Figure
6). Furthermore, in 31 (25.8%) of the interactions, participants accepted the suggestions without editing them, and only in 2 (1.67%) did a participant choose to completely remove a suggestion and write their own response.
The sentence-level suggestions condition has more complex interaction patterns because participants were expected to query for suggestions multiple times and in two distinct ways (either with or without highlighting text from the letter to respond to). Because of this, participants queried the model for many more suggestions: on average 3.72 suggestions with highlighting and 2.91 without per email. In contrast to the message-level suggestion condition, these suggestions were not used as often. Participants only accepted 3.32 of them per email on average, and in 9 (8%) interactions, no suggestions were used at all.
We did observe a difference between acceptance rates between the queries with and without highlighting: participants accepted 60.5% of suggestions with highlighting and 36.7% of suggestions without it. Finally, participants also contributed more tokens themselves in the final response compared to their counterparts who received message-level suggestions, as only 50.60% of the tokens in the final replies originated from the suggestions they triggered (Figure
6).
4.3 Writers’ perceptions
To further understand how each type of assistance is perceived by the users, at the end of the experiment, we surveyed the participants about their perceptions of the suggestions they received and their level of comfort towards political communication mediated by the type of AI assistance they just experienced. The post-task survey consists of both Likert-scale questions and free-form feedback about their writing experience (See Appendix
A.2 for the full list of questions).
Perceived helpfulness of the suggestions. Participants who received message-level suggestions generally agreed that the system is easy to use and that the suggestions they received were natural and useful (Figure
9, Right). However, participants in the sentence-level suggestions condition seemed to have diverging views, and did not rate the naturalness and usefulness of the suggestions as favorably (Figure
9, Left).
This contrast is also reflected in the free-form responses. Sentence-level suggestions are sometimes described as impersonal and not very natural:
“It sounds a bit automated or kind of general sounding, but so do most politicians"
“Most of the suggestions came off as impersonally and artificially uber-patriotic."
However, participants who received message-level suggestions seem quite impressed with the naturalness of the suggestions they received:
“It does sound like it were written by a human and is fully grammatically correct. When it got it right, there were barely any modifications needed on my end.
“I like how empathetic and personable the system is. At no point did I feel like these responses were from a machine. As such, I am curious to try the system out in my everyday life."
“It was quick and the suggested email was similar to what I would’ve written anyways."
A number of factors may have contributed to such differences. First, with the fixed token cap we use in our experiments, the generated suggestions may be cut short and not fully express an idea for the types of topics being discussed. Second, while the message-level suggestions have the full email as prompts, the sentence-level suggestions are generated with a more limited context and thus might be of lower quality. Future work may consider incorporating the message-level context, or even participants’ interaction history, while offering response suggestions towards specific points to further improve the quality of suggestions.
We also notice that participants in both conditions felt rather neutral about the suggestions’ capabilities in inspiring arguments they had not thought of (Figure
9), pointing to an area for future improvement for the generation models.
How did the suggestions help? Participants who received either type of assistance expressed how they liked that the suggestions served as starting points, as arguably the hardest part of the writing process is the beginning:
“I liked that it gave me suggestions for how to start out when I needed inspiration."
“It was extremely handy especially when you dont know what to say or how to word your reply. "
“It is always easier to edit something than write it, even if the starting point is bad—these were solid though."
However, as text suggestions were presented in very different forms, participants made use of the suggestions in different ways. While participants who received sentence-level suggestions tended to find suggestions helpful in making them keep a more professional tone in their response:
“It was easier to keep a professional and political tone, and to quickly generate generic sentences."
“I like that it guided me to answer the letters in a professional manner."
Participants who receive message-level suggestions mainly commented on the usefulness of the draft serving as an outline for further editing:
“With a single click, I had an entire outline for an email, with minimal adjustments to be made."
“It gave a good base outline of how to respond that I could then use to expand upon and put emphasis on things that were really important to the topic."
Writers’ Agency. While participants in the sentence-level suggestions condition perceived that they retain substantial agency (Figure
8, Left), participants who received message-level suggestions tended to think that they played a lesser role in drafting responses (Figure
8, Right). This echoes our earlier observation that participants in the sentence-level condition contributed a much higher percentage of tokens than participants receiving message-level assistance. The autonomy granted to the AI has previously been identified as a key dimension in characterizing AI-mediated communication [
21]. The contrast in the perceived level of agency we observe in our experiment further demonstrates the need to clarify the desired level of agency for writers to retain in order to design appropriate assistance systems.
Likelihood of future use. To explore potential receptions from both the writers’ and the receivers’ perspectives, we asked the participants not only how willing they would be to use such a system to respond to their emails, but also how comfortable they would be if their legislators were to use such a system to respond to their emails (Figure
8). We find that participants tend to feel more hesitant about the platform if they were on the receiving end. The participants in the message-level suggestions condition expressed a rather strong willingness to use similar assistance to respond to their own emails (Figure
8, Fourth row), but they did not feel as comfortable with their legislators using such a system to reply to them (Figure
8, Last row). This suggests that beyond the effectiveness of the assistance, care must be taken for introducing and disclosing the use of such systems to people on both ends of the communication process to avoid tension and mistrust.
4.4 Readers’ perceptions
Following the main study, we conducted a follow-up study to understand how readers would perceive the replies written with the Dispatch system. In the follow-up study, we took replies participants had written in the main study and asked a separate set of crowdworkers how helpful the replies were. In addition to the replies written by the main experiment participants, we also evaluated the helpfuless of message-level suggestions generated by GPT-3 as described in the main experiment without any human editing. To this set of replies, we added a sample of generic auto-replies that legislators sent to real-world inquiries in a previous field study from the Cornell Tech Policy Institute.
We recruited 1,000 participants on Prolific [
36] to evaluate these replies to legislative inquiries. We developed a mock-up of an email conversation displaying both the citizen concerns that main experiment participants had responded to as well as a specific reply. Each participant rated one reply written with sentence-level suggestions, one reply written with message-level suggestions, one reply generated by GPT-3 without human editing as well as one reply that was either a generic auto-reply or written by a human without AI assistance. In addition, half of the participants saw a disclosure label stating that “Elements of this reply were generated by an AI communication tool.” when they saw replies that had been written either by GPT-3 itself or with the help of GPT-3. For each reply, we asked participants whether they agreed with the statement that “The reply is helpful and reasonable” on a 5-point Likert-scale from “Disagree” to “Agree”. For a statistical analysis, we converted their responses to a numeric scale from 0 to 1 respectively and conducted a linear regression analysis with human-written replies as the baseline.
The results are shown in Figure
9. When evaluating the replies participants had written in the main study, participants in the follow-up task indicated that replies written with message-level suggestions (M=0.69, shown central in the right panel) were more helpful than those replies people had written without AI assistance (left in left panel,
M = 0.57,
t(973) = −5.11,
p < 0.0001). Replies that were written with sentence-level suggestions (M=0.60, left in right panel) were seen as less helpful than those written with message-level suggestions and similarly helpful to those written without AI assistance. Replies that GPT-3 generated without human supervision (right in right panel) were seen as slightly more helpful than replies that people had written without AI assistance (
M = 0.62,
t(979) = −1.92,
p = 0.054). In comparison, the generic auto-replies (right in left panel) that busy legislators may send to cope with an overwhelming volume of inquiries were rated as very unhelpful (
M = 0.24,
t(942) = 16.2,
p < 0.0001). Explicitly disclosing the involvement of AI in the reply generation (shown in blue) may have reduced the perceived helpfulness of replies generated with message-level suggestions (
M = 0.65,
t(998) = 1.75,
p = 0.08) and of replies autonomously generated by GPT-3 (
M = 0.56,
t(995) = 2.67,
p = 0.07). However, even when the AI involvement was explicitly disclosed, replies written with message-level suggestions were seen as significantly more helpful than replies written with sentence-level suggestions.
4.5 Characteristics of the responses
While we have discussed the effects of suggestion type on the writing process, we are also interested in how the suggestion type affects the final written product itself. In particular, we investigate three aspects:
Length. We compare the number of words in the responses under different assistance conditions. We find that participants in the control condition produced the longest responses, averaging 115.8 words. The average length of responses from the sentence-level suggestions condition and message-level suggestions condition were both significantly shorter than responses from the control condition, at an average of 90.5 words (
t(238) = −4.94,
p < 0.001) and 88.0 words (
t(238) = −4.29,
p < 0.001) respectively.
11 This is counter-intuitive as one might expect participants with access to suggestions to write more, but this is not the case. The reasons for this result are unclear, but one possibility is that participants in the message-level suggestions condition anchored very strongly to the length of the generated suggestions, and were less likely to add more content. As for participants in the sentence-level suggestions condition, they might have expended additional energy in deciding where to trigger suggestions and choosing which suggestions to use, leading them to spend less time writing.
12 Grammaticality. We compared the grammaticality of responses written under different conditions. To do this, we computed the error rate of the responses as the number of grammatical errors divided by the number of words in the response. Following the methods from prior work [
31], we used LanguageTool to identify grammatical errors in the responses.
13 Similar to previous studies [
14,
31], we find that responses from the message-level suggestions condition have the lowest error rate, averaging 0.158 errors per word. The responses from the sentence-level condition have a slightly higher average error rate, at 0.165 errors per word. The purely human-written control responses ended up having the highest error rate of 0.176 errors per word,
which is significantly higher to both the sentence-level condition (t(238) = 2.64, p < 0.01) and the message-level condition (t(238) = 4.31, p < 0.001). Vocabulary diversity. Vocabularity diversity is a proxy for how engaging or interesting the generations are, and to measure it, we use the distinct-2 score [
32], i.e., the number of unique bigrams / the total number of words in the response. NLP model generations tend to be less diverse than human written text [
32,
47], which is reflected in our results: the control responses have higher distinct-2 scores (
M = 0.944) than both the responses from sentence-level suggestions condition (
M = 0.928) and those from message-level suggestions condition (
M = 0.934), although only the difference between the control responses and the responses from the sentence-level suggestions condtition is significant (
t(238) = 2.91,
p < 0.01).
5 Design Implications
In this work, we explore the effects of sentence-level vs. message-level suggestions in assisting users in email communication. We observe that different forms of suggestions lead to substantially different writing processes: participants receiving message-level suggestions mostly edited the drafts presented to them, skipping the first two steps in the traditional “outline-draft-edit" process [
19] that participants who receive sentence-level suggestions still seem to go through. As a result, participants who received message-level suggestions finished their responses faster while participants who received sentence-level suggestions retained a higher sense of agency in the process.
This contrast, together with other observations we notice in our experiment, suggests that as more technically advanced options become feasible, each with its own advantages and shortcomings, it is all the more important to better understand the needs and specifications of the particular communication circumstance to design task-appropriate assistance systems. Below, we outline a few technical options that may be considered and adjusted according to the communication circumstance.
Unit of suggestion. Suggestions can be offered at different units and lengths. While we experiment with suggestions at the level of short sentences and messages, there are intermediate forms as well, e.g., longer sentences or even paragraphs. As we have observed, this choice affects not just the text entry efficiency, but more fundamentally, the writing process itself.
14 The more complete a draft the suggested text offers, the more the users’ focus may be shifted towards editing and less towards outlining and drafting, implying different degrees of delegation of the writing process. Hence, finding the appropriate suggestion unit requires finding the sweet spot that tailors both to the communication topic, as different topics may take different amount of text to fully develop and express an idea, and also to communicators’ willingness to delegate the task. For instance, prior research reports that people prefer less machine assistance when writing a birthday card to their mothers compared to when responding to mundane work emails [
34].
Availability of options. Participants who received sentence-level suggestions were provided with five candidate options whenever they triggered a suggestion. The availability of choices could potentially allow users more flexibility and increase the chance of offering at least one useful suggestion. In fact, some of the participants receiving message-level suggestions expressed their interest in receiving more candidate responses, e.g.,
“maybe have it give you a choice of 2 or 3 different responses"). However, reading and deciding between candidate suggestions can sometimes be distracting and can take a considerable amount of time, as we have seen earlier (i.e. in Figure
5). Furthermore, offering choices is perhaps only beneficial when a set of diverse and complementary suggestions can be generated. In our experiment, the sentence-level suggestions were sometimes too similar, leading to complaints such as
“some of the suggestions were repetitive, or out of context" and
“The responses were too similar. Most began with ‘I agree!’". In addition, in communication settings when we expect a relatively narrow range of possible responses, like answering a factual question, having multiple options may not be needed or wanted.
Generation model. In this work, we used the best available model at the time (GPT-3) to generate both sentence-level and message-level suggestions, as we are primarily concerned with how humans interact with different types of assistance and we hope to make as fair a comparison as possible. Some of the challenges we observe with generating sentence-level suggestions, i.e., limited input context and limited space to fully develop an argument, are inherent to the task itself and would likely remain even if a different model were used.
Observations and limitations from our studies also point to a number of ways models could be improved to better faciliate such communication process. For instance, the priming effects we observe—participants who recieve message-level assistance write shorter messages than those in the control group because the suggestions are shorter—suggest that fine-tuning generation models to exhibit specific properties, e.g. a particular length range or a formal tone could be helpful. Furthermore, while we have made the distinction based on party affiliation when generating suggestions, legislators can have much more finer-grained differences in their policy stances and communication styles. Personalizing suggestions based on policy stances and communication styles of the legislators is another important avenue for future work.
Beyond text suggestions. While we explore assistance options involving text suggestions, communication assistance systems can help in more aspects of the writing process and offer more than merely generating suggestions for content. Quoting from our participants’ suggestions, additional assistance could range from “highlighting key points and adding blank spaces to share personal opinions and ideas” to help with outlining, “making it easier to see which text the system created and which text was typed by me" to help with reviewing, tracking “responses I made earlier about a similar topic" for future use, or providing related contextual information by generating “a quick tutorial on the subject".
Disclosure of assistance. While we have focused on the writers’ perspective, it is important to remember that successful communication is not just about replying to all of the emails. In many cases, such as in legislator-constituent communication, it is more important to build trust and understanding between people who are communicating. In our experiment, even participants who expressed relatively strong interest in using assistance systems to reply to emails were hesitant about having their legislators use the same system. As such, if such a system were to be incorporated into staffers’ workflow, it is important to consider how to disclose and explain its use to avoid further friction and mistrust between constituents and legislators.
These decisions do not have purely technical solutions. While we attempt to lay out feasible technical options, ultimately, we hope to facilitate the communication processes, and it should be up to the communicators themselves—i.e., staffers, legislators, and constituents in the context of legislator-constituent communication—to make the important value judgments on what they feel comfortable delegating to assistance systems.
6 Conclusion
In this work, we explored two assistance options enabled by the capability of recent large language models to generate long, natural-sound suggestions: sentence-level suggestions and message-level suggestions. To understand the trade-offs between these two types of suggestions, we conducted an online experiment via dispatch, a platform we built to simulate the scenario of staffers from legislative offices responding to constituents’ concerns. The results show that different forms of suggestions can affect the participants’ writing experience in multiple dimensions. For instance, participants receiving message-level suggestions mainly edited the suggested responses and were able to complete the task significantly faster, while participants receiving sentence-level suggestions retained a higher sense of agency and contributed more original content. We discussed the implications of our observations for designing assistance systems tailored specifically to the communication circumstance.
Different communication circumstances have different objectives and demands. Efficiency may be at the core of customer services communications whereas developing trust and credibility is critical for legislator-constituent communication. This work has provided an initial proof-of-concept that we hope will encourage further exploration of communication assistance systems beyond the specific domain studied here. For example, while we targeted users fluent in English, these systems could be even more beneficial to those less proficient in the language. Studying the utility and effectiveness for non-native English speakers would be a fruitful extension of this research. Recent work has demonstrated the capacity of language models to parse ideological nuance, which would be especially important in environments with political polarization where espousing the “wrong” position could alienate voters, although that dynamic was outside the scope of the current study and should be considered for follow-on research.
To conclude, in this work we shed light on the factors relevant for writing assistance systems for legislator-constituent communication. We hope our work encourages further studies towards designing task-appropriate communication assistance systems.
Acknolwedgement. We thank the anonymous reviewers for their helpful comments and Nikhil Bhatt, Gloria Cai, Paul Lushenko, Meredith Moran, Tanvi Namjoshi, Shyam Raman, Aryan Valluri, Ella White for their help with internal testing. This research was supported by a New Frontier Grant from the College of Arts and Sciences at Cornell.
B Additional Results
Adaptation. In both of these conditions, we were also interested in how the participants’ use of the system changed as they drafted each message. Table
1 shows the percentage of tokens from the suggestion or from the participant over the first, second, and third replies written. In the
message-level suggestions condition, participants included slightly more suggestion tokens in the first message than the later ones, while in the sentence-level suggestions condition, suggestions were used more in the second message. That said, these differences are small and suggest that participant behavior was consistent across all interactions. Future work might investigate each participant drafting more messages to see if there is any adaptation behavior.
However, there was a difference in the types of suggestions triggered in the sentence-level condition. When writing the final message, participants more often prompted the model for suggestions without highlighting any text (Figure
10).
Effect of message. One concern we might have is that the different messages loaned themselves to better suggestions. To investigate this, we looked at the percentage of human-written tokens for each of the messages across all of the conditions (Table
2). We found that overall, there was not too much variation among the ten samples of each message, with messages 3, 7, and 11 (for the sentence-level suggestions) having the most human-written tokens across both conditions.