research-article

Open access

A Design Space for Intelligent and Interactive Writing Assistants

CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

Article No.: 1054, Pages 1 - 35

Published: 11 May 2024 Publication History

Abstract

In our era of rapid technological advancement, the research landscape for writing assistants has become increasingly fragmented across various research communities. We seek to address this challenge by proposing a design space as a structured way to examine and explore the multidimensional space of intelligent and interactive writing assistants. Through community collaboration, we explore five aspects of writing assistants: task, user, technology, interaction, and ecosystem. Within each aspect, we define dimensions and codes by systematically reviewing 115 papers, while leveraging the expertise of researchers in various disciplines. Our design space aims to offer researchers and designers a practical tool to navigate, comprehend, and compare the various possibilities of writing assistants, and aid in the design of new writing assistants.

1 Introduction

A writing assistant is a computational system that assists users with improving the quality and effectiveness of their writing, from grammar and spelling checks to idea generation, text restructuring, and stylistic improvement. In our current era of rapid technological advancement, however, the research landscape for writing assistants is becoming increasingly fragmented across various communities. While numerous writing assistants have emerged in recent years, quite disparate research communities like Natural Language Processing (NLP), Human-Computer Interaction (HCI), and Computational Social Science (CSS) study writing assistants with different emphases, such as model performance, user interaction, and social phenomena. There are even more specific areas like creativity support, second language acquisition, and disability studies, each of which may struggle to stay up to date with work happening across other communities. This splintering poses a significant challenge for researchers and designers seeking a holistic view, making it essential to bridge these gaps for effectively navigating the complexities of sociotechnical systems [143, 242].

In this paper, we contribute a design space to provide a structured way to explore the multidimensional space of intelligent and interactive writing assistants. Design spaces are taxonomies that present critical aspects of a design [32, 33, 164, 176, 211] with three main uses. The first is to establish a shared vocabulary that can help streamline communication and collaboration between researchers, designers, and other stakeholders. The second use case is to provide support in understanding existing designs. They allow for one to reflect on why certain design choices were made, and why a given design may succeed in certain ways and fail in others. The third is to support envisioning new designs. By thinking about regions of a design space that may be “empty” (i.e., given a set of dimensions, no design considers all of them), we can think about what a design in that space would look like, and if it may be worth pursuing.

Through a large community collaboration, we create a design space based on five key aspects of writing assistants—task, user, technology, interaction, and ecosystem (Figure 1)—based on the sociotechnical systems perspective. Within each aspect, we identify dimensions (i.e., fundamental components of an aspect) and codes (i.e., potential options for each dimension) by systematically reviewing 115 papers and employing an iterative coding process. As a result, our design space contains 35 dimensions and 143 codes (Table 1–7), which includes writing contexts (e.g., academic, creative, and journalistic) and users’ relationship to a system (e.g., agency, ownership, trust, and privacy) that are closely related to interaction metaphors (e.g., agent and tool). It also includes diverse learning problems for technology (e.g., classification, regression, and generation) as well as wider ecosystem considerations, such as digital infrastructure (e.g., usability consistency and technical interoperability), norms and rules, and change over time (e.g., information environment). We provide two illustrative scenarios that demonstrate how our design space can be utilized for a range of stakeholders (e.g., researchers and policymakers) in Section 5.1.

Our main contribution is the development of the design space by systematically reviewing existing work, identifying important components within each aspect, and connecting those elements under one framework. While we recognize that this design space may not be exhaustive or permanent, we believe it provides researchers and designers with a practical means for exploring, understanding, and comparing the diverse potential of writing assistants beyond their immediate fields. We anticipate that this work will foster dialogue and research in the realm of writing assistants, ultimately aiding in the creation of innovative, ethical writing assistant designs. We publicly release our annotated papers as a living artifact at https://writing-assistant.github.io to promote community involvement in refining and extending the design space to accommodate new developments and insights from various disciplines.

Figure 1:

2 Background

Technology has had a significant impact on the way we write. Some technologies long precede the appearance of computers, such as clay tablets known as cuneiform writing [182, 216], and in the 19th and 20th century, typewriter technologies [192], each of which dramatically changed the way we produce written documents. Technical advances introduced computer-powered text-entering systems, which further developed into word processors. Word processors allowed users to flexibly edit texts with functions like deletion, copy-paste, and find-and-replace [192]. These systems further evolved to support the writer’s cognitive process, or the distinctive thought processes the writers go through when composing writings [73, 100]. Examples of such support include providing text analysis [78, 162] and text-planning support [50] (see Appendix C for more examples).

While supporting the cognitive process of writing can help people improve their writing proficiency and efficiency, how to do so consistently and persistently remains a challenge. For example, what kinds of support should we provide to the human writing process and with which interactions? How can we technically enable such support? And when should we “fade” this support as a temporary “scaffolding” [191] to encourage independent writing? One thread of work is understanding the human writing process, such as the cognitive process theory of writing [73]. Another thread has been on extending the “intelligence” of writing assistants, so that these assistants can provide a wider range of support, such as those that require “understanding” of long and complicated texts. For example, for a writing assistant that provides suggestions on filling in between already written story events [62, 86], the system would need to grasp what happened in the preceding and subsequent events in order to coherently connect them. Fortunately, relevant technologies have advanced rapidly within the fields of NLP, where most recently, language models (LMs) [22, 183, 184] and their prompt- and example-based usage showed impressive capabilities in generating coherent text.

Leveraging knowledge of human writing and technical advances, researchers and practitioners have designed and built many intelligent and interactive writing assistants. Some of these aimed to support existing human writing tasks and domains, such as help request writing in professional settings [114], instant message writing in affectionate relationships [134], or tweetorial writing for scientific communication [85]. Researchers also studied how writers interact with these new writing assistants and how they influence human writing. For example, Buschek et al. [27] studied how generative writing suggestions can influence email writing behaviors. Similarly, Lee et al. [146] collected a dataset of human-LM interactions during creative and argumentative writing to analyze how people interact with and get support from these technologies. Some recent work studied how writers leverage different approaches to prompt LMs [58, 266] and introducing novel interaction paradigms to steer writing assistants, such as visual sketching [45].

As new writing assistants are being introduced at an increasingly fast pace, there are growing concerns that we lack a comprehensive understanding of when and in what ways it is desirable to use these writing assistants in order to avoid potential unforeseen consequences or ethical issues [120, 169]. For instance, in the academic community, there are concerns about students’ use of AI during writing and homework [20, 55, 74, 168]. Moreover, while there are many different aspects that need to be considered when designing these assistants, often only a few are considered, resulting in a model-centric approach to building systems [131]. Reflecting upon these concerns, some researchers strive to deepen our understanding about intelligent and interactive writing assistants as a whole. For example, Wan et al. [250] studied interaction patterns of recent human-AI writing assistants. Similarly, Gero et al. [82] analyzed existing writing assistants along the dimensions of writing goals and technologies. Another effort involved understanding user perspectives in getting support in their writing processes [86]. However, we are still far from having a comprehensive picture of the landscape. Outside of writing assistants, researchers have studied various aspects of creativity support tools that support human creation for comprehensive understanding [44, 79]. However, such a study has not yet been conducted specifically on intelligent and interactive writing assistants.

3 Approach

Our goal is to develop a design space for intelligent and interactive writing assistants. To this end, we first decide on the scope (Section 3.1) and core aspects of writing assistants (Section 3.2). Then, we perform a systematic literature review, collecting papers within the scope and employing an iterative coding process, while focusing on the core aspects (Section 3.3, Figure 2). In doing so, we collaborate with a large team of researchers from a variety of disciplines including Human-Computer Interaction (HCI), Natural Language Processing (NLP), Information Systems (IS), and Education.

3.1 Scope

We define the scope of our work by specifying what we consider as intelligent and interactive writing assistants. Note that these are working definitions that align with the objectives of our study, and are not intended to be imposed as a universally accepted definition.

•

Intelligent: We consider systems to be intelligent if they are capable of autonomous decision-making and/or text generation, with a special focus on modern AI, such as language models (LMs).

•

Interactive: We consider systems to be interactive if they reflect human input and/or output (as opposed to generating text without human involvement) and facilitate an iterative process to produce a written artifact.

•

Writing: We consider systems to be relevant to writing if at some point human users translate their thoughts into written language, perhaps via system support, and produce text in a natural language (e.g., English) as a final artifact.

To have a stronger and tighter focus on intelligent and interactive writing assistants, we exclude the following types of work: programming tools (to focus on writing as a means to communicate with humans as opposed to machines), collaborative tools (to focus on the dynamics between a user and a system as opposed to multiple users), text-entry tools (to focus on tools that facilitate the cognitive process of writing, rather than input methods, such as gestural, hand-writing, or speech-to-text). Throughout the paper, we interchangeably use “interactive and intelligent writing assistants” and “writing assistants” to refer to the writing assistants that fall under our scope (see Appendix B for our definitions of “technology,” “system,” and “model”).

3.2 Five Aspects of Writing Assistants

Writing assistants are sociotechnical systems that involve interaction between both technical (e.g., AI and hardware infrastructure) and social aspects (e.g., user behaviors and societal norms). The sociotechnical systems perspective is an interdisciplinary approach [19, 143, 214], which has been widely used in various research fields like HCI [139, 149] and Information Systems (IS) [39, 256] to analyze complex interactions involving novel technology, workflow design, and organizational changes. The perspective considers “task,” “user,” “technology,” and “structure” as integral and interdependent parts of a complex system [143].

Echoing the sociotechnical systems perspective’s broad coverage and core assumption that technology does not exist in a vacuum and is always interconnected with people and other systems, we adopt the four parts with the following modifications: We first split “technology” into two—“technology” and “interaction”—to be more aligned with current research in NLP and HCI, respectively, and rename “structure” as “ecosystem” to account for the broader context in which writing assistants operate. As a result, we use the following as the five key aspects of writing assistants.

•

Task: Writing stages, contexts, and purposes that writing assistants aim to support. This involves understanding the purpose of writing, the constraints imposed on writing, and the intended audience for the written content.

•

User: Characteristics and preferences of users of writing assistants, providing insights into how different user groups may prioritize different attributes in their interactions with these systems.

•

Technology: Advancements in technology that underpin the capabilities of writing assistants, highlighting the quality and quantity of data, modeling problems and techniques, and evaluation methods.

•

Interaction: Diverse interaction paradigms and essential user interface components of writing assistants, that contribute to the dynamic interplay between the user, the interface, and technology powering the interface.

•

Ecosystem: Issues stemming from the broader context in which writing assistants operate. This involves economic, social, and regulatory considerations that impact how these systems are developed, used, and evolved.

3.3 Systematic Literature Review

Figure 2:

Figure 2 shows the overall process of our systematic literature review and the design space creation.

3.3.1 Paper Selection.

To understand the current research literature on writing assistants, we perform a systematic literature review. We first identified fields closely related to intelligent and interactive writing assistants. We selected the fields of HCI and NLP as our core fields¹ and listed their relevant associated venues in Appendix D.1. From the ACM Digital Library ACL Anthology 3.3 for details). Then, three of the authors filtered candidates that fall under the scope of the project (Section 3.1). These authors first independently annotated the relevance of candidate papers, discussed disagreements, and then refined the scope. As a result, we selected 115 out of 419 papers that fulfilled the inclusion criteria for our literature review.

Note that the above sampling method targets a specific set of existing writing assistants, as opposed to all possible options of writing contexts, user groups, and technologies for future writing assistants. Concretely, papers under our review explicitly include a specific technology, a human user, and interaction between the user and the technology through an interface in the context of writing. If a paper does not present a strong connection to all of these components, we exclude it from the review. For instance, we exclude a technical paper that rewrites content in a target style even if it can be used in writing assistants (i.e., it does not explicitly consider user and interaction aspects, but studies style transfer in isolation). In the following section, we describe how we complement the strict exclusion of these works by leveraging the expertise of the authors by adding flexibility to the coding process.

3.3.2 Code Development.

With the 115 papers, the authors were split into five teams based on the five aspects (Section 3.2) to develop codes for each aspect. Then, every team sub-selected the papers that were relevant to their aspect. For instance, some papers might allude to potential users but lack descriptions of exact target users; the team focusing on the user aspect (henceforth “the user team”) excluded those papers from their consideration. At least two authors from each team read each paper to decide the relevance while resolving any disagreement through discussions.

To go beyond the understanding of existing writing assistants and think about possibilities for future writing assistants, we intentionally allowed flexibility in the code development process in the following ways. First, while some teams (task and user) derived the initial codes mostly from their selected papers (i.e., inductive coding), other teams (technology, interaction, and ecosystem) created the initial codes based on external knowledge (i.e., deductive coding). Second, to provide insights on recent advancements in technology that are not yet leveraged for writing assistants, the technology team explicitly brought in 25 papers (outside of the 115 papers) based on relevance and importance (see Appendix D.4 for more details) and referenced them while iteratively refining the codes. Third, most teams considered relevant works that are not necessarily about writing assistants (e.g., papers solely about LMs) to inform their codes. Note that we used these external references to aid our code development but did not include them as part of our coding process.

With the subset of 115 papers selected by each team, the authors in the team iteratively coded papers while refining the codes. The iterative coding process began with the distribution of papers to members of each team. We distributed papers so that each paper was read by at least two members, while maximizing the number of combinations of readers to facilitate discussion among team members. With distributed papers, each team member first independently read and coded papers with the up-to-date version of the codes. Then, team members had a discussion session to share and agree on how they coded each paper. If team members decided that a specific paper could not be adequately coded with the existing code structure, they updated the code structure either by revising, merging, splitting, adding, or removing codes. As codes were updated, team members returned to already-reviewed papers to ensure that papers were coded with the up-to-date code structure. The process of coding papers repeated until the team reviewed all papers under the final set of codes.

3.3.3 Final Design Space and Coding.

After each team developed their codes, all teams gathered to create the final design space by combining the dimensions and codes from all teams. During the process, we removed overlapping dimensions and codes, improved their consistency across the aspects (e.g., inclusion criteria and granularity), and revised their names and definitions for clarity. Once we finalized the design space, we repeated the iterative coding process, where each paper was read by two authors (same as the initial iterative coding in Section 3.3.2). This time, all 115 papers were coded with the full set of 35 dimensions and 143 codes from the five aspects. We briefly analyze trends and gaps based on this final coding of the papers (Section 5.2) and report two metrics for inter-coder reliability: Percent agreement (mean: 0.93, std: 0.06) and Krippendorff’s alpha (mean: 0.69, std, 0.17).² We release our coded papers as a living artifact at https://writing-assistant.github.io and encourage others to contribute by adding papers beyond what we covered in this work to track future developments in this space.

4 Design Space

In this section, we present a design space as a structured way to examine and explore the multidimensional space of writing assistants (Figure 1). Our design space encompasses five key aspects—task, user, technology, interaction, and ecosystem (Section 3.2)—which are interconnected and play vital roles in the realm of writing assistants. Within each aspect, we define dimensions that represent the fundamental components of that aspect and codes that denote potential options for each dimension, based on our systematic literature review (Section 3.3). In the following sections, we provide detailed explanations regarding the dimensions and codes specific to each aspect along with concrete examples of research papers associated with each code. When doing so, we formulate our dimensions as questions and present codes as possible answers to the questions, similar to questions and options in MacLean et al. [164].

4.1 Task

Table 1:

	Code	Definition
Writing Stage	At what point in the writing process is the task taking place?
	Idea generation	Brainstorming and developing concepts or content themes
	Planning	Organizing the structure and content outline
	Drafting	Composing the written content
	Revision	Reviewing and refining the written material
Writing Context	What combination of stylistic norms, audience expectations, and domain-specific conventions characterize the approach to the task?
	Academic	Focuses on research, analysis, and formal presentation of knowledge within educational contexts
	Creative	Focuses on imagination, narrative, artistic elements, and original storytelling
	Journalistic	Focuses on factual reporting, news coverage, and conveying information to the public
	Technical	Focuses on complex information, instructions, or explanations in specialized fields
	Professional	Focuses on precise and formal writing like reports and documents
	Personal	Focuses on individual thoughts, experiences, and emotions; can be private or communicative
Purpose	What is the purpose of the written artifact?
	Expository	Intending to convey factual and informative content to the audience
	Narrative	Intending to convey a story
	Descriptive	Intending to provide expressive details
	Persuasive	Intending to convince or influence opinions or actions
	Educational	Intending to teach and help people learn
	Entertainment	Intending to engage and amuse for leisure or enjoyment
	Analytical	Intending to provide in-depth examination, analysis, or evaluation
	Accessibility	Intending to support individuals with health conditions or impairments
	Translation	Intending to convert content from one language to another
	Feedback	Intending to provide evaluative comments or responses to content
Specificity	How detailed are the task requirements?
	Nonspecific	Having no explicit guidelines, instructions, or objectives
	General direction	Having a broad indication of the desired outcome without specifying precise steps
	Specific objectives	Having some specific objectives contributing to the task
	Detailed requirements	Having detailed and measurable instructions
Audience	Who is the intended recipient of the task output?
	Specified	Audience is clearly identified or explicitly stated
	Implied	Audience is assumed or inferred without explicit mention

Table 1: Task dimensions, codes, and definitions

In the design space of writing assistants, we define a task as a rhetorical purpose of a written artifact identified by a user (e.g., to persuade) at a specific stage of writing (e.g., revision) and in a particular context (e.g., academic). This task can be articulated to the assistant with varying degrees of specificity (from nonspecific to detailed requirements) and aim to create a textual artifact that a particular audience will consume. With this definition, the task aspect embodies user-driven objectives, facilitating a nuanced, user-centered approach.

4.1.1 Dimensions and Codes.

We describe dimensions and codes for the task aspect. Figure 1 (“Task”) shows task dimensions in a broad context, while Table 1 lists all dimensions, codes, and definitions.

Writing Stage:. At what point in the writing process is the task taking place? In the intricate and iterative process of writing, a writing assistant can be designed to support specific writing stages³ to provide more targeted and relevant support. For example, during the idea generation stage, writers brainstorm and develop concepts. In this stage, writing assistants can offer ideas for potential topics and content [85, 217]. In the planning phase, writers focus on organizing the structure and content outline. Writing assistants can play a crucial role in aiding writers during this phase, for instance, by reducing planning time [10, 129]. During the drafting stage, writers may require assistance with thought facilitation or initial drafts [178, 189]. The revision stage involves identifying and rectifying mistakes, making necessary corrections, and enhancing the content. Writing assistants at this stage might be used to provide feedback on various aspects of the written text [2, 249].

Writing Context:. What combination of stylistic norms, audience expectations, and domain-specific conventions characterize the approach to the task? This dimension delves into the diverse contexts of writing that shape the task’s approach based on community and domain-specific norms. The academic context is characterized by rigorous research, thorough analysis, and the formal presentation of knowledge. Within this context, writers may benefit from assistance that upholds formal practices in genres such as scientific writing [9] and theses [203]. In the creative context, the focus lies on imagination, artistic expression, and original storytelling. An intelligent writing assistant within this context can provide support for diverse creative endeavors, including writing lyrics [254], crafting metaphors [84], and offering inspiration and assistance for storytelling [185, 217]. The journalistic context centers around factual reporting, news coverage, and effective communication of information to the public. Writing assistants can play a supportive role in this process by assisting with science writing [132] and generating suggestions based on keywords [47]. In the technical context, authors convey complex information, provide instructions, and explain specialized topics, such as by writing figure captions [186]. In the professional context, precise and formal writing is essential. Writing assistants in this context can be tailored to aid in producing reports and documents related to companies, trade, and professional work [113, 247]. In the personal context, individuals engage in private or public reflections on their thoughts, experiences, and emotions. Writing assistants tailored for this context can provide support for tasks that involve emotional expression in writing [189, 252], as well as fostering empathetic relations with readers [193].

Purpose:. What is the purpose of the written artifact? A writing task may be motivated by a broad range of writing goals that guide the writing process. One fundamental goal is to explain; expository applies to writing tasks whose purpose is to convey relevant facts and knowledge [137, 203]. Another common purpose is telling a story; narrative applies to tasks whose goal is to convey an account of real [90] and imagined experiences [88, 108, 208]. Some texts are written to be descriptive in order to convey emotion [252] and provide expressive details, particularly with evocative language [83, 84, 158]. Meanwhile, persuasive writing aims to convince or influence the audience through text [2, 120], requiring high readability [128] and concision [114]. Sometimes, the purpose of the written artifact can be to be educational (e.g., readers learning about a topic by reading the written artifact). In this context, writing assistance could assist writers in conveying educational ideas in an informative and clear manner [10, 85]. With entertainment as its purpose, a writing task can aim to engage and amuse readers through text, requiring coherent storytelling [132, 172] to engage the audience and may use the structure of the writing itself to convey surprise [80] and pleasure [144, 254]. Analytical writing provides in-depth examination, which benefits from reflection and iteration [57, 112]. Accessibility describes tasks that aim to support inclusivity; for example helping neurodivergent individuals explain and write about common social situations [133]. Likewise, second language writing can be difficult due to unfamiliar spelling and grammar rules [155] as well as vocabulary [111], and writing assistants can help by aiding in translation to ease the difficulty of writing in an unfamiliar language [111, 111, 155, 274]. Lastly, feedback describes tasks that provide evaluative comments, such as customer reviews [13, 63] and constructive responses [217].

Specificity:. How detailed are the task requirements? To answer this question, we examine design choices that add specification to the writing task. Nonspecific is applied to tasks that do not have specific objectives. Systems supporting these tasks often include generic drafting environments that bundle many functionalities [107, 144] or general-purpose systems that do not specify a particular objective [193, 208]. Tasks that provide a broad indication of the desired outcome without specifying the precise steps needed to achieve it are considered to have general direction as their specificity. These tasks may suggest the direction of an outcome through writing conventions [190], writing reflection variations [80, 133], or writing format [47, 64], but will not define how the desired outcome should be realized. On the other hand, some writing tasks have specific objectives that directly connect to the overall goal. For example, while a general direction might be a target writing format, specific objectives may suggest the inclusion of certain subsections [54, 114] or information [186] that contribute to that overall goal. Finally, tasks with detailed requirements come with detailed and measurable instructions for the task. Weber et al. [255], for example, outline a task for supporting legal writing based on the case solution’s major claim, definition, subsumption, and conclusion, as well as elements and relations in the subsumption, which are highly specific requirements about what should be done in the task.

Audience:. Who is the intended recipient of the task output? This dimension describes the audiences for whom the output of the task is intended. Specified audiences are specifically mentioned in the paper, such as the academic community [85], people on the autistic spectrum [133], people with limited vision [178], and writers themselves [12, 213]. An implied audience is inferred by the system design or characterization of the textual artifact, for example online shoppers for a tool designed to help writers review products [64], business stakeholders for a system that aids with writing introductory help requests [114], and writers themselves for a tool that assists with personal reflection [252].

4.2 User

Table 2:

	Code	Definition
Demographic Profile	What are the demographic details of the users considered?
	Gender	User’s gender
	Race	User’s race or ethnicity
	Socioeconomic status	User’s socioeconomic status
	Language & Culture	User’s primary language and cultural background
	Age	User’s age
	Education	User’s educational background
	Profession	User’s occupation
User Capabilities	What are attributes of users that associate with the writing processes and can be shaped by writing assistants?
	Writing expertise	User’s writing expertise in terms of writing quality or genre specializations
	Efficiency	User’s writing efficiency, often measured as number of words written, time spent, or effort expended
	Technical proficiency	User’s understanding of and comfort level with the underlying technology
	Confidence	User’s confidence in the writing process and final product, such as self-efficacy or perceived skill level
	Creativity	User’s engagement in creative exploration, including fostering curiosity and innovative thinking
	Emotion	User’s emotional state before, during, or after using the writing assistant
	Empathy	User’s ability to emotionally and cognitively empathize within the context of writing
	Cognition	User’s cognitive aspects, such as focus, sense of immersion, cognitive load, and writer’s block
	Neurodiversity	User’s neurological profiles (both neurotypical and neurodivergent)
Relationship to System	What influences the long-term perceptions users have of a writing assistant?
	Agency	User’s sense of control or autonomy in their interactions with the writing assistant
	Ownership	User’s sense of ownership or authenticity over the written artifact when using the writing assistant
	Integrity	User’s concerns about plagiarism and sense of integrity when using the writing assistant
	Trust	User’s sense of trust in the writing assistant’s ability or perception of its suitability for a task
	Availability	User’s expectation of the writing assistant being at hand when one needs or wants to use it
	Privacy	User’s concerns about how their data is handled by the writing assistant
	Transparency	User’s understanding of the writing assistant’s mechanism, capabilities, and limitations
System OutputPreferences	What influences users’ perceptions of system outputs?
	Textual coherence	Outputs that are coherent in terms of grammar, content, and tone
	Textual diversity	Outputs that are novel and diverse, providing inspiration or pleasant surprises to the user
	Explainability	Additional information for outputs that explains the rationale behind system outputs
	Bias	Outputs that exhibit various forms of bias, such as skewed perspectives on topics and societal stereotypes
	Personalization	Outputs that are personalized based on user preferences

Table 2: User dimensions, codes, and definitions

In this section, we ask who stands to benefit from these assistants and illuminate the varied needs and preferences of users that might influence design considerations for their systems. These distinctions remind us that the effectiveness of a writing assistant often lies in its attunement to a user’s unique requirements.

4.2.1 Dimensions and Codes.

Figure 1 (“User”) shows user dimensions in a broad context, while Table 2 lists all dimensions, codes, and definitions.

Demographic Profile:. What are the demographic details of the users considered? The demographic profile of users reflects a broad spectrum of their characteristics, highlighting the diversity inherent across various groups of users. Some systems ensure equal representation of minority and marginalized groups by incorporating gender and race into their design [105]. Others assist the writing process of users with diverse socioeconomic status, as it influences the availability and usage of technology, thereby impacting user experiences and needs [89]. Language and culture, reflecting users’ fluent languages and cultural backgrounds, uniquely shape their writing experiences and outcomes. A number of studies focus on non-native English writers [27, 37, 274] in English writing contexts to address challenges in non-native language writing and to enhance language learning experiences. Understanding age is important when tailoring systems to accommodate the developmental and cognitive characteristics of users across a wide age spectrum, ranging from young, pre-literate children [267] to adolescents [89, 213]. Similarly, education level and background of users, whether they are high school students [213] or university graduates [2, 248], indicates varying cognitive and learning competencies, which in turn may influence how users interact with and engage in systems. Finally, some studies emphasize tailoring system designs to a specific profession, such as professional creative writers [80, 172], ensuring that the systems meet the unique writing needs of different professions.

User Capabilities:. What user attributes associated with the writing process can be influenced by writing assistants? User capabilities represent a distinct category from demographic attributes, which remain largely unchanged while interacting with writing assistants; user capabilities are often targeted for improvement by writing assistants. Writing expertise captures the user’s writing proficiency or expertise (e.g., amateur vs. professional writer) in terms of writing quality [185, 186] or genre specializations (e.g., science writing vs. email writing) [85, 114]. Efficiency, or how efficiently the user can complete specific writing tasks, was also considered in many papers. Several studies have utilized metrics such as word count, time, and effort expended by the user to quantify improvements in the user’s writing efficiency before, during, and after using a writing assistant [7, 27]. Technical proficiency relates to the extent to which a user is knowledgeable about the underlying technology of a writing assistant. Understanding how a LM functions, for instance, has been shown to influence how effectively the user engages with a writing assistant [185]. Several papers in the literature focus on enhancing user capabilities related to emotional and cognitive aspects. For example, several studies capture the user’s confidence in both the writing process and the final product, possibly expressed as self-efficacy or perceived skill level [114, 259]. Creativity examines how the writing assistant can foster the user’s creative exploration or curiosity when performing a writing task, for instance, by supporting idea generation [45, 85, 265]. Emotion refers to the user’s emotional state before, during, and after interacting with the writing assistant [16, 189, 252]. Empathy focuses on the user’s ability to emotionally and cognitively empathize with others in the writing process. This empathetic focus was observed within educational contexts, where students are instructed to write more empathetic peer reviews [248, 249]. Cognition looks at cognitive aspects like the user’s focus, sense of immersion, and cognitive load, as systems can increase cognitive engagement by tackling phenomena like writer’s block [12, 45, 217]. Finally, neurodiversity encompasses considerations for users with diverse neurological profiles, such as aphasia [179] and dyslexia [71].

Relationship to System:. How do users build a mental model of the system? As users engage with a system, they gradually develop a mental model of its functioning, which subsequently shapes their interaction and engagement with the system. First, agency refers to users’ sense of control over the system or the writing process. It is typically facilitated by providing users with options to steer the model outputs, either through adjusting model parameters [47, 88] or by using customized prompts [58, 265]. The ownership or authenticity of a final product can be influenced by system design. A writer’s sense of ownership may diminish as the proportion of system-generated text increases [146], yet this issue could be mitigated by personalizing writing assistants to mimic a writer’s unique style [86], or by designing assistants with greater agency [179]. Similarly, maintaining a sense of integrity is an important factor when assisted by AI. This encompasses worries about unintentional plagiarism and the moral implications of using writing assistants [15, 85]. Trust is another critical facet of users’ perceptions, referring to their perception of the system’s capabilities and their sentiments toward the technology itself. The level of trust users hold towards AI could influence the human-AI collaborative writing experience [15, 156]. A system’s availability emerged in the context of comparing human support to computer support, where human writers (e.g., friends) are not always readily available, while computer programs are typically perceived as constantly accessible [86]. Privacy highlights users’ concerns regarding how their data is handled by a system, including a sense of surveillance over their writing process [12, 189, 259]. Lastly, users are concerned with the transparency [153] of writing assistants as they seek clarity on how systems operate [13, 193], how data is used [193], and AI’s role in these systems [13, 34, 189].

System Output Preferences:. What influences users’ perceptions of system outputs? Understanding how writers evaluate system outputs, such as writing suggestions, is crucial as it can influence their interaction and engagement with the system. One common consideration is textual coherence, which underlines the need for grammatically and contextually coherent outputs [85, 146]. Another significant dimension is textual diversity, which emphasizes the importance of offering varied system outputs to foster creativity in writing [84, 85, 226]. The explainability of the system can also influence users’ perceptions of its outputs. Providing additional information to explain the rationale behind system outputs may enhance user understanding and engagement [190]. Additionally, system-generated content may exhibit various forms of bias, ranging from skewed perspectives on topics [120, 195] to societal stereotypes [105]. Lastly, the personalization of system outputs, which involves adapting to and reflecting an individual’s unique writing style [80], may enhance the user’s writing experience.

4.3 Technology

Table 3:

	Code	Definition
DataSource	Who is the creator of the data used to train or adapt a model?
	Experts	Experts of the task or domain of interest
	Users	Current or target users of a downstream application
	Crowdworkers	Crowdworkers from various platforms, such as Amazon Mechanical Turk and Prolific
	Authors	Authors of the research paper
	Machine	Another model that generates synthetic data
	Other	Other creators, such as non-experts and unspecified individuals for web crawled data
DataSize	What is the size of the dataset that used to train or adapt a model?
	Small (<100)	Uses one to a hundred data points
	Medium (<10k)	Uses a hundred to a few thousand data points
	Large (<1M)	Uses tens of thousands of data points
	Extremely large (>1M)	Uses millions of data points or more
	Unknown	Unknown data or dataset size
ModelType	What is the type of the underlying model?
	Rule-based model	Model relying on heuristics or deterministic approaches (e.g., lookup tables and regular expressions)
	Statistical ML model	Machine learning (ML) model typically trained for a specific task (e.g., logistic regression)
	Deep neural network	ML model that uses multiple layers between the input and output layers (e.g., RNNs and LSTMs)
	Foundation model	Pre-trained deep neural network that can be adapted for downstream tasks (e.g., BERT, GPT-4)
ModelExternal Res.	What additional access does the underlying model have at inference time?
	Tool	External software or third party APIs the model might rely on to perform its task
	Data	External datasets or resources, such as external knowledge repositories
LearningProblem	How is the writing assistance task being formulated as a learning problem?
	Classification	Assigns inputs to predefined categories or classes
	Regression	Predicts continuous numeric values for a given input
	Structured prediction	Focuses on capturing dependencies, relationships, and patterns in language data
	Rewriting	Translates text from one form to another, while preserving its meaning and information content
	Generation	Creates new, coherent and contextually relevant text
	Retrieval	Finds and optionally ranks relevant instances for a given input
LearningAlgorithm	How is the underlying model being trained?
	Supervised learning	Model is trained on labeled data where each input is associated with the correct output
	Unsupervised learning	Model is trained on unlabeled data to discover patterns and structures within the data
	Self-supervised learning	Model creates a supervisory signal from the data itself, without human-annotated labels
	Reinforcement learning	Model learns by interacting with an environment and receiving feedback in the form of rewards

Table 3: Technology dimensions, codes, and definitions

Table 4:

	Code	Definition
LearningTraining & adaptation	How is the underlying model being trained or adapted for a specific task at hand?
	Training from scratch	A new model is trained from scratch, or a foundation model is used without adaptation
	Fine-tuning	A foundation model is further trained on a specific dataset
	Prompt engineering	A foundation model is further adapted via prompts (a.k.a. in-context learning)
	Tuning decoding parameters	A model’s decoding parameters are adjusted to stir model outputs (e.g., temperature and logit bias)
EvaluationEvaluator	Who evaluates the quality of models outputs?
	Automatic evaluation	Quality is evaluated based on simple aggregate statistics or on similarity metrics
	Machine-learned evaluation	Quality is evaluated by another model to produce ratings or scores (e.g., BERTScore)
	Human evaluation	Quality is evaluated by human annotators
	Human-machine evaluation	Quality is evaluated by both human annotators and another model, often by having the model filter a subset of outputs to be annotated by humans
EvaluationFocus	What is the focus of evaluation when evaluating individual model outputs?
	Linguistic quality	Evaluation focuses on qualities such as grammaticality, readability, clarity, and accuracy
	Controllability	Evaluation focuses on the reflectiveness to controls or constraints specified by users or designers
	Style & Adequacy	Evaluation focuses on the alignment between model ouputs and their surrounding texts
	Ethics	Evaluation focuses on social norms and ethics, such as bias, toxicity, factuality, and transparency
Scalability	Does the design of the underlying model concern with cost and latency for deployment?
	Cost	Cost of deploying the model
	Latency	Delay between when the model receives an input and generates a corresponding output

Table 4: Technology dimensions, codes, and definitions

The technology aspect of writing assistants considers the advancements that underpin the intelligence and capabilities of the systems. We aim to describe the end-to-end process of developing underlying models that can be used for writing assistants, considering learning problem formulation, data properties, modeling techniques, evaluation methodologies, and large-scale deployment considerations, all of which play a crucial role in determining the quality and degree of intelligence in the writing assistants.

4.3.1 Dimensions and Codes.

Figure 1 (“Technology”) shows technology dimensions in a broad context, while Table 3 and 4 list all dimensions, codes, and definitions.

Data - Source:. Who is the creator of the data used to train or adapt a model? The source of the data used to develop a system or train a model can have a direct effect on the system’s overall performance and reliability. A dataset can be authored by experts who have domain knowledge of the specific downstream task [3, 128, 255, 273], or users of the system, during their interaction with the writing assistant [9, 27, 177, 252]. However, due to the difficulty of recruiting real experts and users, many researchers resort to crowdworkers to create data or annotate data entries [35, 130, 251, 272]. Sometimes, authors themselves participate in the preparation and annotation of the dataset [125, 193, 195, 269]. Recently, we see more datasets that are generated by a machine [105, 123, 196, 227], which has the advantage of being relatively cheap and fast to generate at scale compared to human-generated datasets. Finally, there are other types of creators such as non-experts, unspecified individuals, or a broad set of creators (e.g., in the case of web crawled data) [13, 225, 265, 274].

Data - Size:. What is the size of the dataset⁴ used to train or adapt a model? Depending on the size of a dataset required to train or adapt (e.g., fine-tuning or prompting) a model, there can be a huge overhead in terms of data collection. While some models can be developed using very small data (between 1 to 100 examples) [10, 85, 156], the others require much larger data. If the training needs more data (around 100 to few thousands of examples) which is often the case for fine-tuned models, we categorize them as medium [37, 97, 252, 253, 268]. For larger datasets (around tens of 1000s of examples) we denote this as large [56, 204, 254]. For models that undergo extensive large-scale pre-training, we categorized data used in this process as extremely large to indicate a dataset of millions of examples [43, 220, 225, 273] or more. We also included an unknown if the paper did not explicitly mention the dataset used for training [178, 190, 235].

Model - Type:. What is the type of the underlying model? Advancements in AI accelerators and the availability of large amounts of data have led to an evolution in model architectures,⁵ which we capture as the following four types. First, rule-based models rely on pre-defined logic, lookup tables, regular expressions, or other similar heuristic approaches that are deterministic in nature [10, 29, 90, 218]. For statistical machine learning (ML) models, we consider models that are trained from scratch on historical data, are not necessarily “deep” (as in deep neural networks), and are used to make future predictions (e.g., support vector machines and logistic regression) [111, 121, 193, 268]. Over the past decade, deep neural networks have been the popular models of choice for writing assistants, including recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) [7, 47, 208, 259]. Finally, recent works have increasingly utilized foundation models, such as BERT [61], RoBERTa [157], GPT [23, 198, 199], and T5 [200], to name a few. A foundation model is “any model that is trained on broad data that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks” [17]. These models can perform a wide range of tasks out of the box, learn from a few examples to provide tailored support to users, and be further fine-tuned for specific downstream task(s) [146, 220, 251, 255].

Model - External Resource Access:. What additional access does the model have at inference time? Recently, models have been developed with access to additional tools or data at inference time to make them capable of providing assistance beyond the knowledge encoded in their parameters. In the case of tool, a model may access external software or third-party APIs to perform tasks like search, translation, or calculator, or even setting calendar events on behalf of users [43, 178, 264, 269]. Data refers to external datasets or resources, such as information stored in a database, external knowledge repositories, or any other structured/unstructured data sources that the models might leverage to provide writing assistance [132, 220, 227, 273].

Learning - Problem:. How is the writing assistance task being formulated as a learning problem? How exactly writing assistants support their users usually varies based on the learning problem their underlying models are designed to solve. Classification refers to the class of problems that require categorizing data into predefined classes based on their attributes. It is one of the most widely formulated classes of problems in writing assistants, applicable to tasks such as detecting errors in writing [77, 243] and detecting the purpose of writing revisions [123], among others. In contrast, regression problems involve the prediction of a continuous numerical value or quantity as the output instead of categorical labels or classes. This includes problems such as the prediction of the writer’s sentiments [249], the readability [128], or the emotional intensity [237] of written text as numerical ratings or scores. Structured prediction refers to a class of learning problems that involve predicting structured outputs or sequences (e.g., sequences, trees, and graphs) rather than single, isolated labels or values. Numerous works have focused on developing these approaches to make edits to improve the quality of written text during the revision stage [136, 165, 166, 167, 230]. Rewriting problems involve sequence transduction tasks, where texts from one form are transformed to another while improving the quality by making them fluent, clear, readable, and coherent. These tasks are essential in various writing assistance applications, such as grammatical error correction [37, 43, 274], paraphrasing [264], or general-purpose text editing [66, 70, 201, 223] to name a few. Generation refers primarily to problems that involve the creation of new, contextually relevant, coherent, and readable text from relatively limited inputs, such as autocomplete, paraphrasing, and story generation [7, 45, 116, 217]. Retrieval problems take the input from a user as a query (e.g., keywords) and search in a knowledge base or dataset for relevant information. Such problems may involve ranking the available data based on its relevance and similarity to the input but do not necessarily include the generation of new text beyond what is available in the knowledge base [37, 220, 229].

Learning - Algorithm:. How is the underlying model trained? The models used as back-bones of writing assistants incorporate different training mechanisms based on the type of the available data, as well as the specific downstream tasks. In supervised learning, models are trained on a labeled dataset where each input is associated with the correct output. Some of the commonly used methods include Logistic Regression, Random Forests, and Naive Bayes [16, 26, 97, 253]. Supervised learning also includes approaches such as Transfer Learning, which involves training a model on a large dataset and then fine-tuning it for a specific task or domain using a smaller dataset [68, 274]. In unsupervised learning, models are trained on unlabeled data to learn patterns and structures within the data. This approach includes techniques such as representation learning and clustering methods, to name a few [185, 254]. Self-supervised learning approaches train models on unlabeled data with a supervisory signal [81]. These approaches leverage the benefits of both supervised and unsupervised learning, especially in scenarios where obtaining a large amount of labeled data is challenging. This includes pre-training objectives for large language models such as Causal Language Modeling [199] and Masked Language Modeling [61]. In reinforcement learning (RL), models learn by interacting with an environment and receiving feedback in the form of rewards. This approach is useful for tasks requiring action sequences, such as language generation and dialogue systems [225].

Learning - Training and Adaptation:. How is the underlying model being trained or adapted for a specific task at hand? The training and adaptation process is integral part of developing an intelligent model that can perform tasks at hand and support user needs. Many models used to be trained from scratch [83, 98, 158, 235] before foundation models. On the other hand, with the advance of foundation models (e.g., BERT and GPT-4), the common learning paradigm has been shifted to “pre-training” a large model on broad data and then “adapting” the model to a wide range of downstream tasks. One way to adapt a model is fine-tuning, where the pre-trained model is further trained on a specific dataset [13, 195, 226, 272]. Note that there are numerous variants of fine-tuning, such as transfer learning, instruction tuning, alignment tuning, prompt tuning, prefix tuning, and adapter tuning, among others. Another way to adapt a model is prompt engineering (or “prompting”), where one can simply provide a natural language description of the task (or “prompt”) [22] to guide model outputs [58, 120, 146, 172]. A prompt may include a few examples for a model to learn from (“few-shot learning” or “in-context learning”). Lastly, we can tune decoding parameters of a model to influence model outputs (e.g., changing temperature to make outputs more or less predictable, manipulating logit bias to prevent some words from being generated) [88, 146, 218].

Evaluation - Evaluator:. Who evaluates the quality of models outputs? A core aspect of model development is its evaluation. We consider four common types of evaluators who can review and evaluate various qualities of model outputs (as opposed to writing assistants or user interactions). Automatic evaluation compares machine-generated outputs with human-generated labels or texts using aggregate statistics or syntactic and semantic measures. These include metrics like precision, recall, F-measure, and accuracy, as well as ones used in generation tasks such as BLEU [188], METEOR [142], and ROUGE [263] to name a few [3, 37, 235]. Machine-learned evaluation uses automated metrics, which are themselves produced by a machine-learned system. These are typically classification or regression models that are trained to evaluate the quality of model outputs [123, 196, 219, 249, 270]. On the other hand, human evaluation corresponds to evaluating the system with human annotators either directly interacting with, or evaluating the output of a writing assistant. Some evaluations may require the judging of task-specific criteria (e.g., evaluating that certain entity names appear correctly in the text [173]), while others can be generalized for most text generation tasks (e.g., evaluating the fluency or grammar of the generated text [156, 163, 247, 265]). Human-machine evaluation captures cases where both machine-learned metrics or models and human judges are involved in the evaluation of the outputs. This hybrid evaluation is particularly relevant in co-creative, mixed-initiative writing assistance settings. Such studies often involve expert users and participatory methodologies [45, 146, 154, 172].

Evaluation - Focus:. What is the focus of evaluation when evaluating individual model outputs? Evaluating (or benchmarking) models has been a long standing challenge in NLP [152]. In particular, as we increasingly use foundation models (e.g., GPT-4) for a wide range of downstream tasks, it is difficult to evaluate the quality of model outputs across all tasks, let alone the difficulty of evaluating open-ended generation. Here, we highlight four common evaluation focus relevant to writing assistants in the literature. Linguistic quality focuses on the grammatical correctness, readability, clarity, and accuracy of the model’s outputs. This aspect ensures that the outputs are not only correct in terms of language use but also easily understandable and precise in conveying the intended message [58, 128, 251, 273]. Controllability assesses how well the model’s outputs reflect constraints (or control inputs) specified by users or designers. For instance, how effectively the model adheres to any specific level of formality or writing style [120, 195, 217, 238]. Furthermore, it is crucial that the model’s responses not only make sense in isolation, but also fit seamlessly within the broader context of the text. Style & adequacy pertains to the alignment between the model’s outputs and their surrounding texts or contexts. This includes evaluating the stylistic and semantic coherence, relevance, and consistency of the outputs with the given context [84, 160, 178, 265]. Finally, ethics encompasses a range of crucial considerations such as bias, toxicity, factuality, and transparency. Ethics focuses on the model’s outputs adherence to social norms and ethical standards, and seeks to avoid generating outputs that contain harmful biases, misinformation, and other unethical elements [15, 105, 193, 227]. This aspect of evaluation is particularly critical in maintaining the trustworthiness and societal acceptance of the model.

Scalability:. What are the economic and computational considerations for training and using models? Recent models, especially LMs, have demonstrated exceptional performance across various tasks [25, 40, 184]. However, the significantly large size of these models has substantially increased the cost of their development [127]. In this regard, directly utilizing pre-trained LMs via prompting [23, 257] or employing efficient fine-tuning methods like low-rank adaptation [109] and prefix-tuning [151] can help avoid the cost of full fine-tuning. During deployment, this affects not only the inference costs but also the latency, which often degrades user experience [30, 147]. Techniques such as quantization [87] and knowledge distillation [103] have shown promising results in addressing these issues.

4.4 Interaction

Figure 3:

Table 5:

	Code	Definition
UI Interaction metaphor	What is the interaction metaphor for the system?
	Agent	System is meant to evoke a sense of another agent acting on the interface or data
	Tool	System is presented as tools, where there is not a sense of interacting with another agent
	Hybrid	System draws from both “agent” and “tool” metaphors
UI Layout	Where are the system interactions situated in the UI?
	Writing area	In-situ interactions take place in the same place as user output text
	Separated	Interface isolates the system outputs and controls in a separate layout panel
	Input UI	Interactions are situated in the input device
	Custom	System has a dedicated custom interface
UI Interface paradigm	What is the general platform of the interface?
	Text editor	The main interface is a text editor
	Chatbot	The main interface is a chat client
	Other	Other UI paradigms, including those developed custom for the application at hand
UI Visual differentiation	How is the system output differentiated from user output?
	None	System output is not differentiated from user output after it is accepted into the text
	Formatting	System output is included with user output, but is differentiated by color or other formatting
	Location	System output is in a separate location or UI panel from user output
	Media	System creates a different type of output than the user, such as images
UI Initiation	How is system output triggered?
	User-initiated	Users initiate a request or prompt for system to generate an output
	System-initiated	System provides output based on internal rules
User Integrating system output	How does the user integrate system output?
	Selection	Users select a system output, such as accepting a suggestion from multiple suggestions using a button or key press
	Inspiration	Users do not keep system output, but may be inspired to create new text on their own
	Editing	Users keep, edit, or remove system output
	No integration	System may provide outputs that are not meant to be added or inform direct contributions

Table 5: Interaction dimensions, codes, and definitions (continued in the next table)

Table 6:

	Code	Definition
UserSteering the system	How can the user steer the system?
	Explicit	User can control system behavior by selecting buttons, checkmarks, etc.
	Implicit	User updates user text and the system takes it as input to generate output
	No control	User cannot control system; the only controls were pre-decided by the designer
SystemUser Data Access	What user data can the system access through the UI?
	Input text	The text the user is working on
	Additional data	Extra data that are not intended to be part of the input text, such as random seeds, control labels, or prompts
SystemOutput Type	What type of output does the system create?
	Analysis	Feedback, analytics, or context based on automatic analysis of the user’s text
	Generation	New content that is intended to be incorporated into the final product
	Proposal	New content that is meant to be referenced but not directly incorporated into the final product
SystemCuration Type	How are system outputs curated?
	Model	Model generates outputs which are directly used by the system
	Curated	System designers curates a list of outputs in advance, and the system picks one for the user
	Customized	A response from a curated list is selected and then further customized by the system for the current context
	Deterministic	User input automatically determines outputs

Table 6: Interaction dimensions, codes, and definitions (continued from the previous table)

The interaction between a user and a writing assistant primarily involves three key components: User, user interface (UI; frontend), and system (backend). The UI acts as a mediator, facilitating interaction between the user and the system, as illustrated in Figure 3.

4.4.1 Dimensions and codes.

Figure 1 (“Interaction”) shows interaction dimensions in a broad context, while Figure 3 shows a detailed visualization. Tables 5 and 6 list all dimensions, codes, and definitions.

UI - Interaction Metaphor:. What is the interaction metaphor for the system? Interaction metaphors shape how the user relates to the system. We identify three primary metaphors. First, systems designed as agents employ designs meant to evoke human-like interaction, including roles like “collaborator” or “co-writer” [36, 146], “dialog partner” [189, 246], “assistant” [80], and “companion” [84]. Techniques to evoke the agent metaphor include character-by-character text rendering to simulate typing [120], avatars [29], implicit and explicit conversational interaction [189], and first-person conversational styles [193]. In contrast, other systems present as tools, where there is no sense of interacting with another agent. These systems tend to avoid conversational styles and rather present feedback in imperative or factual style [113, 247, 255]. They provide traditional GUI elements (e.g., checkboxes, buttons) that spread out system capabilities rather than centralizing them into one “agent.” Hybrid systems blend “agent” and “tool” metaphors in their design or their authors’ descriptions [84, 146].

UI - Layout:. Where are the system interactions situated in the UI? Often, interaction with the underlying system takes place in the writing area where users create text. This supports the selection of the existing writing [57, 220] and/or seamless integration of output from the system [58, 146]. Alternatively, a design might choose to isolate the interaction with AI as a separated UI element, such as through a sidebar. A separated design puts clear boundaries between the users’ writing and the output of the system. For example, this design style is used to display information related to text diagnosis [243], provide inspiration [45, 132], or support language learning [37]. Interaction with the system can also be embedded with the user’s text input UI, such as text suggestions on touchscreen keyboards [7]. Finally, there are custom designs: for example, a tangible UI that triggers the system by lifting a coffee cup [12], or an exploratory visualization for referencing information which has no user text input [130].

UI - Interface Paradigm:. What is the general platform of the interface? The text editor is the prevalent interface for AI-assisted writing [58, 220]. It provides writers with a traditional blank canvas while incorporating a variety of system-driven functionalities such as feedback [113], automated checklists [80], or completion suggestions [120]. In contrast, chatbot interfaces use turn-taking interactions for motivation [189] or suggestions [35]. Conversational exchanges serve to progressively and iteratively achieve a goal, like fiction writing [15, 48, 217]. Finally, other interfaces cater to specific needs, introducing novel [130, 213], sometimes multimodal [45, 226], interactions. For example, such custom interfaces take lyrical structure into account for music generation [254], and emphasize the conceptual and figurative expressions to generate metaphors for scientific writing [132].

UI - Visual Differentiation:. How is the system output differentiated from user output? This dimension identifies different visual designs for separating user-written text and system outputs. A system must present its outputs to the user in some way, or the interaction loop between the user and the system is incomplete. The most common mechanism is to use text formatting such as colors and underlines [58, 264, 265]. Another common mechanism is to keep system outputs in a separate location like a wizard or panel [57, 193, 249]. A system may not use any formatting when the system output has a different media type (e.g., meta-analysis [113, 255] or audio [12]) than the text written by the user. Alternatively, the differentiation is “none”: System output is presented identically to user output after it is added into the text [7, 146].

UI - Initiation:. How is system output triggered? System outputs can be triggered in two main ways: User-initiated triggers give the user control over when system output is created or presented in the UI. For example, phrase suggestions might be displayed whenever the user presses the “tab” key [146] or clicks on a “generate” button [85]. In contrast, with system-initiated triggers, the system has control over when it brings in its output. This might be a rule, heuristic, or a dedicated trigger decision model. For example, the user and system might take turns such that the system output is triggered when the user submits a chat message [189] or sentence [48], or pauses writing for a certain amount of time [13, 27, 120]. For some systems, this initiative is almost real-time, or “live.” For example, several systems update suggestions and feedback in a dedicated panel while the user is typing [7, 37, 243].

User - Steering the System:. How can the user steer the system? Users need to communicate their intentions and goals to a system to steer its behavior. In the implicit paradigm, users directly compose the artifact (typically, text in a workspace) as a way to communicate their intentions to the system which in turn provides support based on this information. Such systems often provide support in the form of text suggestions [146, 265], reflection [57, 189], or inspiration and ideation [84]. Alternatively, users have explicit controls over the behavior of the system. For example, users give thumbs up or down on shown text suggestions to steer future output towards user preferences [49], control the diversity of generated text through numeric parameters [88], or steer story arcs via sketching [45]. The two paradigms have implications for the users’ workflow [58]: implicit input can be less disruptive to the writing process, while explicit input offers more ways for users to express intentions. Finally, some systems offer no control to the user, for example, if the LM is prompted in the background to respond in a particular way regardless of the user’s prior text [120].

User - Integrating System Output:. How does the user integrate system output? After the system creates an output, the user can choose how to integrate it into the overall goal. For material intended to be included in the output, the user may take a selection action, such as accepting a suggestion using a button or key press [7, 132, 146]. When the system does not provide material for direct inclusion, the user may engage with the output as inspiration and choose to type their own text [189, 252]. If the system text is inserted into the final text with no explicit user interaction, the user must then choose whether to keep, edit, or remove the text, through editing [45, 172]. Finally, the system may provide outputs that require no integration, if they are not meant to be added to an output or to inform direct contributions. For example, analysis for future improvements [113, 255].

System - User Data Access:. What user data can the system access through the UI? Intelligent writing assistants can access different types of user data to produce the desired outputs. Typically, the user’s input text is the primary source of data; the current writing progress can be used to generate completions [7, 146] or provide feedback [29, 77]. It is also important to allow users to have additional controls over the text generations, which we characterize as additional data. For example, some systems allow users to specify the task for generation via explicit instruction [35, 265] or via sketching [45].

System - Output Type:. What type of output does the system create? Intelligent writing systems can generate different types of outputs to support users’ writing. System output might serve as an analysis of the user’s writing. For example, how clear text is in an administrative context [77], how empathetic a peer review is [249], or how structured and formal writing is in a professional context [113]. Systems can provide this analysis in the form of annotations or highlights on the text [113, 249], analytics in a separate UI [128, 132], or high-level statements about the text [29, 203]. In contrast, generation is usually realized as completions [7, 13, 254], although it can also encompass longer sections (e.g., creating a paragraph of text [45, 68]). Finally, some system output only provide proposals: Content meant to be used as reference but not directly incorporated into the document. This content can be in the form of planning and outlining support [203, 213], or questions about the user’s writing to encourage reflection [29] or summaries of the text so far [57].

System - Curation Type:. How are system outputs curated? The predominant way that intelligent writing support systems curate outputs is to generate a custom response from an NLP model, especially with the recent dramatic increase in fluency and capability of LMs [120, 146, 208, 254]. However, providing the natural language outputs of such models requires the designer to give up a great deal of control over the possible outputs. In sensitive contexts such as topics like mental health [189, 193, 252] or to leverage the benefit of high-quality source material, designers choose to trade off the flexibility of an NLP model for the control of a pre-made, curated list of responses [229], from which an output is selected by the system [29, 37, 217]. To increase personalization, these pre-curated responses may also be dynamically customized [189]. Alternatively, the output may be generated from user input in a deterministic way, in writing systems that do not use AI and/or prefer rule-based automation [12, 137, 213].

4.5 Ecosystem

Table 7:

	Code	Definition
Digital Infrastructure	What are compatibility issues considered?
	Usability consistency	Alignment of user experience with other systems that the user is familiar with
	Technical interoperability	Ability to communicate and work together with other systems, applications, or devices seamlessly
Access Model	How does the openness of data, models, and products influence design decisions?
	Free and open-source software	Factors related to free access, derivation of work, and redistribution
	Commercial software	Factors related to the use of or integration as part of a commercial product
Social Factors	Who affects the design and use of writing assistants?
	Design with stakeholders	Accounting for stakeholders’ perspectives and behaviors
	Design for social writing	Accounting for writers’ social context of writing, such as co-authors
Locale	Does the writing assistant’s design take into account features of a physical locale?
	Local writing	Design for writing at “home base” with full sociotechnical networking
	Remote writing	Design for writing remotely from “home base”
Norms and Rules	What norms and rules affect the design and use of a writing assistant?
	Laws	Legal requirements, such as privacy, copyright, and age-appropriate content
	Conventions	Cultural, professional, or organizational norms & standards
Change Over Time	What are the key temporal considerations when designing a writing assistant?
	Authors	Changes in authors’ perception, knowledge, and skills regarding writing assistants
	Readers	Changes in readers’ perception, knowledge, and skills regarding writing assistants
	Writing	Changes in written artifacts due to the use of writing assistants
	Information environment	Changes in the existing text and knowledge landscape due to the use of writing assistants
	Technologies	Changes in the technologies powering writing assistants
	Regulation	Changes in the laws and conventions that govern the use of writing assistants in the ecosystem

Table 7: Ecosystem dimensions, codes, and definitions

As writing assistants become embedded in authentic contexts, we must consider the embodied, material, sociotechnical environment (see Appendix C for discussion on micro- vs. macro-HCI). Following Guggenberger et al. [92], we consider the ecosystem as “the overarching sociotechnical context in which the writer and the tool are situated, encompassing a range of complex, interdependent actors that frequently play a role in the functioning of the writing assistant.” This aligns with the writing model proposed by Hayes [99], updating the cognitive processes of writing proposed in 1981 [73] to add the social and physical environments. Although much of the ecosystem is beyond the immediate control of writing assistant designers, this aspect draws attention to how one might design in anticipation of its influences.

4.5.1 Dimensions and codes.

Figure 1 (“Ecosystem”) shows ecosystem dimensions in a broad context, while Table 7 lists all dimensions, codes, and definitions.

Digital Infrastructure:. What compatibility issues are considered? A key compatibility issue to consider is usability consistency, i.e., the extent to which the writing assistant’s user experience intentionally aligns with other systems in the writer’s ecosystem. Examples in the corpus include integrating AI language technologies into everyday writing apps [96], extending Google Docs [244], and using familiar-looking BibTeX style codes to enable writers to invoke remote bibliographic searches [10]. This example also illustrates technical interoperability with external services, which designers may conceive in many ways. For instance, another project integrated tangible media with their writing assistant, such that lifting a mug on a digital coaster triggered text-to-speech replay of recent sentences to assist reflection [12].

Access Model:. How does the openness of data, models, and products influence design decisions? This dimension taps into how the openness and licensing of data, models, and products may influence design decisions and dissemination of artifacts. First, when a new writing assistant is developed to be embedded within existing commercial software, products, or platforms, the aesthetic and display space of these products exert a strong influence over the UX and UI design of the writing assistant, as we see in the examples of Airbnb [121] and Facebook [189]. This could further affect the choice of data (e.g., proprietary data) and models (e.g., closed models). On the other hand, for many researchers, it is common and often desired practice to build and work with free and open-source software (as well as open-source data and models [239, 241, 258]) given their transparency and free availability. Lastly, researchers in turn can open source their writing assistants [146, 179] as well as other associated artifacts, such as interaction traces for human-LM collaborative writing [146] and programming languages for generating text [107], to promote collaboration and reproducibility.

Social Factors:. Who affects the design and use of writing assistants? This dimension draws attention to how designers engage with stakeholders, and the social support around authors. First, designing with stakeholders concerns human-centered and participatory design methods to meaningfully involve target authors, and other groups such as educators, coaches, or subject-matter experts. This was by far the most prevalent ecosystem dimension in the corpus, attributable to design concept interviews, co-design sessions, and usability studies with target authors [172, 186, 226]. While not in the corpus, other research focuses on co-designing tools with educators to build trust in their design [212, 221]. Second, designing for social writing covers the formal and informal support network writers may call on, including co-authors, peers, mentors, friends and family. For instance, children writing at home may involve informal social support from their parents [231, 233].

Locale:. Does the writing assistant’s design take into account features of a physical locale? Digital systems are used in physical environments, whose affordances could aid writing (e.g., whiteboard used to plan a document, sticky notes on screens, and opportunistic conversations with passing colleagues) or could impede writing (e.g., a noisy environment may permit notetaking but disrupt the attention needed for detailed writing; fieldwork with limited Internet access may prevent certain writing until back in the office). This dimension draws attention to whether the writer is engaged in local writing (i.e., at what is considered to be “home base” with full social and technical networking), or remote writing (i.e., away from home base, with different affordances and constraints) such as fieldwork or commuting. Much of the research to date did not attend to this explicitly, but examples include designing an intentionally calm interface to help reduce classroom distractions for vulnerable young people [89], and using a messaging platform to encourage more reflective writing for well-being [189, 252].

Norms & Rules:. What norms and rules affect the design and use of a writing assistant? Writing is embedded in countless societal processes, therefore they will inevitably be governed by various norms and rules, both formal and informal. First, there can be alignments or misalignments with laws. While none of the corpus papers addressed legal issues, this is of course an active field of scholarship [95]. Furthermore, U.S. and E.U. legislation is changing [53, 106] with intellectual property legal cases under way [75, 206]. Consequently, this code draws designers’ attention to legal changes, which could conceivably shift market preference to LMs trained on ethically sourced data, or passing a particular algorithmic impact assessment. Second, writing assistants could account for societal conventions. User-centered design builds on concepts and practices familiar to users, such as writing conventions in job application letters [113] and a system based on an established typology of expository phrases for science writing that readers and writers would recognize [85]. Other types of conventions guiding design decisions include features of good metaphors [84], the social acceptability of automating emails, and clinical principles to support for writers with aphasia [179] or their mental health [189]. Emerging evidence [31, 60] may raise expectations around employee productivity, with the Writers Guild of America strike in the U.S. demonstrating the conflict that the proliferation of AI writing is now provoking [181]. While there were no corpus papers documenting the embedding of writing assistants into established work practices and conventions, we see this beginning to happen in K-12 [212] and higher education [52, 138].

Change Over Time:. What are the key temporal considerations when designing a writing assistant? This last dimension recognizes that people, technology, regulation, and the broader information environment are in motion, not static. We anticipate writing assistants and the ecosystem influencing each other over different timescales, from instantaneous to longer-term change. As detailed in the user and interaction dimensions, writing assistants can be shown to have demonstrable effects on authors and their writing outcomes, with empirical studies documenting immediate effects on product reviews [7, 120], stories [226], screenplays [172] and business pitches [247], to name just a few genres. However, while we found no empirical studies beyond a single writing session, some researchers anticipate the longer-term risks of homogenization in creative writing [5, 84, 187], professional deskilling in written communication [27], and loss of author agency simply through fatigue in reviewing AI suggestions [13]. Designers should consider the risk that AI-generated text is ingested as training data by other AI projects (“model collapse” [224, 240]), an example of change in information environment [7].

We can also anticipate changes in readers. As LMs can (co-)author complex writing that is hard to distinguish from human-authored text [46, 122], designers should consider different readers’ criteria for trustworthiness [121]. However, as with authors, there were no longitudinal studies of readers or reading practices with writing assistants. Given the current pace of technological development, frequent changes in technology will be the norm, much more so than what we observed in the past [94, 161]. A longer-term HCI perspective might ask how systems can be designed to gracefully adapt to the evolution of technology (e.g., assist an author who dislikes the new LM to roll back to their older, more personally-tuned version). Finally, as regulation catches up with technological advances, it could affect model performance (e.g., models trained on copyright material could be banned), procurement (e.g., corporate AI governance restricts products meeting new ethical standards), or subscription models (e.g., more “ethical” LMs might cost more).

5 Discussion

In this section, we share use cases for our design space, our reflections on the current literature and ethical implications, the challenges and limitations we encountered in creating the design space, and our plans for future work.

Figure 4:

5.1 Use Case Scenarios for the Design Space

In this section, we present two use case scenarios for our design space. These scenarios illustrate how the design space can be utilized (e.g., generative and analytical) and demonstrate its value for a range of stakeholders (e.g., researchers and policymakers). They also underscore the interdependencies and trade-offs between different dimensions and codes.

Generative scenario. Suppose a group of researchers aims to create a writing assistant, specifically designed to aid non-native English writers in choosing the best paraphrase among multiple paraphrases generated by AI.⁶ Motivated by previous work, the researchers plan to build a prototype of the writing assistant to improve the writing quality of the target user group by offering explainability features. Upon examining the design space, the researchers find they have already factored in numerous dimensions, such as the writing process [revision], writing context [academic], demographic profile [language & culture], system output preferences [explainability], and digital infrastructure [usability consistency]. However, they realize that they overlooked certain key dimensions like data - source and evaluation - focus of the foundation model they intend to use, as well as dimensions like relationship to system [trust, transparency] and interaction metaphor that could influence how the user perceives the system. These insights prompt them to do investigations into their options for foundation models that they may not have previously considered, take into account user concerns of trust and transparency, and think about various ways to frame and present the system to users. Overall, the researchers find that the design space ensures they do not overlook important design decisions, resulting in a richer and more thoughtful design.

Analytic scenario. Suppose a group of policymakers is concerned about the unintended consequences of AI-powered writing assistants swaying public opinions. This worry stems from a research report suggesting that co-writing with opinionated language models (LMs) can influence writers’ views.⁷ To gain a comprehensive understanding of the context from which these findings originate, the policymakers refer to the research paper as well as our design space. By mapping the writing assistant used in the paper to the design space, they gain a nuanced understanding of the experimental context (e.g., writing stage [drafting], specificity [general direction], and model - type [foundation model]). More importantly, they identify several factors that could potentially alter the findings. For example, the writing assistant in the study automatically provided suggestions to users (UI - initiative [system-initiated]), rather than allowing users to request suggestions when needed. Furthermore, users had no way to explicitly control or guide the system’s output, and had limited implicit control; even though the underlying model took user text as input to account for the user’s writing style and existing contents, it was consistently prompted by the system to output text in favor of a pre-determined stance on the topic that the user was asked to write about (user - steering the system [implicit]). Recognizing these factors, the policymakers realize they could potentially introduce regulations to make user-initiated interactions mandatory and to allow users to explicitly steer the system’s output. They could also recommend that designers visually differentiate between user-generated and AI-generated text (UI - visual differentiation). Finally, the design space draws attention to a state regulatory proposal to categorize as “high risk” any AI system that could subtly bias voting behavior through nudging (norms & rules [laws]).

Figure 5:

5.2 Trends and Gaps in the Literature

Based on an analysis of our corpus of papers, we see that there is a sharp increase in papers about writing assistants starting in the mid-2010s (Figure 4). This increase is roughly equal among HCI and NLP venues, though there are slightly more HCI than NLP papers in our corpus. It is this increase that spurred our interest in creating a design space to support the increasing number of researchers and designers working in this space.

Based on our final coding of all papers, we observe that certain dimensions in the design space are over- or under-represented. Figure 5 shows the number of papers per dimension which were coded as relevant. To highlight a few notable trends, we see audience is under-represented compared to the other task dimensions, suggesting future work may want to more explicitly consider who is the audience of a piece of writing. Scalability is quite under-represented overall, as well as relative to other technology dimensions, suggesting that there may currently not be enough consideration of the economic and computational costs of training and using recent large models in the context of writing assistants. Finally, most ecosystem dimensions are, as previously noted, under-represented, representing a rich area of future study as writing assistants become more widely adopted and the circumstance and context of their adoption becomes increasingly important. Longitudinal studies should illuminate if, and in what ways, writers’ relationship to system changes through extended use, and how it will affect not only the writers, but also readers and information environment (change over time).

We also note that technological advances are driving changes in writing assistants. The use of foundation models has rapidly increased in just the past few years; we see 13 papers with this code in 2023 versus 1 in 2020 in our corpus. We expect this number to grow substantially in the coming years. However, we have not yet seen a corresponding increase in codes that seem relevant to their increased usage, such as user concerns of trust and transparency, or technological evaluations of controllability or issues of ethics. We hope that the provision of our design space can help researchers and designers think about these issues as they become increasingly important with rapid technological advances.

5.3 Ethical Implications of Writing Assistants

Writing assistants, while beneficial, hold a potential for serious risks, particularly when intentionally designed or misused by individuals or organizations to plagiarize content [145, 194], generate deceptive content [115], or systematically sway opinions [120], thus requiring careful scrutiny and ethical considerations. Additionally, uses of writing assistants have begun to affect labor markets [24, 69, 180, 181], signifying substantial societal consequences and evidencing the need for monitoring such effects. As these systems integrate into various industries, a comprehensive, multi-faceted approach to ethical governance is essential to tackle sector-specific concerns.

Another ethical consideration is accommodating the needs and preferences of diverse users, including those from differently-abled, under-represented, and marginalized communities [1, 14, 102]. Beyond traits inherent to users, such as primary language and culture, such accommodations should consider user preferences or capabilities like writing expertise and technical proficiency, which may widely vary across educational, socioeconomic and neurocognitive backgrounds. Failure to address these factors can lead to misalignment of expectations or biased outputs that further perpetuate inequalities and stereotypes. Future work could consider, for instance, value-sensitive design, which acknowledges the importance of understanding the needs, preferences, and concerns of different user groups. This is especially important in sensitive contexts, such as education or healthcare.

5.4 Challenges in Developing a Design Space

We underscore that the five aspects within the design space have blurry boundaries, as some dimensions may straddle multiple aspects. When defining dimensions, we sought to increase coverage and make them as mutually exclusive as possible. However, in some cases (e.g., the relationship between writing context and purpose), this was simply impossible. Defining dimensions and codes that were not frequently mentioned or implicitly addressed in research papers posed additional challenges. For instance, many dimensions for ecosystem (e.g., digital infrastructure, locale, and access model) were sometimes possible to infer from papers but were often not explicitly mentioned.

In addition, we notice that some aspects’ dimensions and codes have inherently different natures. For instance, the user and ecosystem aspects focus on the very existence of codes in work (e.g., whether a paper takes demographic profile [age] or digital infrastructure [usability consistency] into account when designing a writing assistant). On the other hand, other aspects (e.g., interaction) presume the existence, and focus on the classification (e.g., whether UI - interface metaphor is close to agent, tool, or hybrid). Furthermore, we find that user dimensions can be not only design choices, but also reported properties from user studies. For example, researchers might use a general-purpose writing assistant in their user studies, but focus their evaluation and analysis on the users’ relationship to system [agency]. To handle this, we duplicated user dimensions and coded for both design choices and reported properties, while keeping the official set of user dimensions without duplicates.

Some codes intrinsically have continuous values, and converting these into discrete codes remained challenging (e.g., specificity and data - size). Even when codes are discrete, the space of possible codes can be vast; in this case, we focus on the elements that are explicitly mentioned and frequently observed in the coding process (e.g., system output preferences and evaluation - focus), leaving room for extension in the future. When applicable, we abstracted codes to increase the coverage and generalizability of the codes, while trading off their specificity (e.g., generation instead of “dialogue,” “story generation,” and “question answering”). During the coding process, we were able to selected multiple codes for a dimension, to account for a writing assistant’s various functionalities and purposes.

5.5 Limitations and Future Work

One limitation of our work is the coverage of sampled and filtered papers, restricted by the search criteria. Using “write” in titles or keywords may not be particularly suitable for NLP papers. While expanding the search to include abstracts would increase the corpus, this approach could also lead to a diminishing return rate. Consequently, the focus remained on developing useful dimensions and codes, rather than striving for an exhaustive collection of papers. Another limitation was not explicitly including commercial writing assistants (without research papers) in our search. In our coding process, it is possible that we may have misinterpreted authors’ intentions, which may have resulted in errors in our annotation. Despite these limitations, we argue that this design space serves as a reference for examining existing assistants and developing new ones, while preventing implicit assumptions or overlooked considerations, thereby facilitating a holistic understanding of the factors that drive design choices. For future work, we hope to continue to refine our dimensions and codes and code commercial writing assistants to further understand the gaps and opportunities in the current research landscape and suggest possible directions for future research and development.

5.6 Lessons for Future Writing Assistants

To promote creativity and exploration in this emerging area, we intentionally avoid imposing subjective views, prescribing how to design writing assistants, or when and where writing assistants should be used. Instead, we share lessons learned along the way as helpful for our fellow researchers and designers. We believe it is important to recognize the interconnection of dimensions and codes across the five aspects and their trade-offs, and utilize them as a reference when designing writing assistants. As technology continues to evolve, we anticipate new capabilities, interaction designs, and insights about user preferences and behaviors to emerge. Therefore, it is essential to remain adaptive to these changes and heed them when designing and analyzing writing assistants. Lastly, there are many under-represented or under-explored dimensions in the design space (Section 5.2). We encourage researchers and designers to venture into these overlooked dimensions and offer their innovative ideas and unique insights, contributing to the holistic development of writing assistants.

6 Conclusion

In this work, we present a design space as a structured way to examine and explore the multidimensional space of intelligent and interactive writing assistants. Through community collaboration and systematic literature review across multiple disciplines, we define 35 dimensions and 143 codes exploring five key aspects of writing assistants. We hope that this design space provides researchers and designers a practical tool to navigate, comprehend, and compare the various possibilities of writing assistants, and aids them in the design of new writing assistants.

Acknowledgments

We thank CHI 2024 ACs and reviewers, Carly Schnitzler, Daniel Jiang, Rishi Bommasani, Advait Bhat, Martin Zinkevich, Tania Bedrax-Weiss, and Minsuk Chang for their valuable feedback on the manuscript. We disclose the use of various intelligent and interactive writing assistants in the process of writing this manuscript. However, we note that the use was primarily limited to editing the authors’ own text and the authors checked for plagiarism, misrepresentation, fabrication, and falsification of content. Among our authors, Simon Buckingham Shum is supported by University of Technology Sydney Learning & Teaching Grant: AcaWriter. Jin L.C. Guo and Avinash Bhat are supported by Natural Sciences and Engineering Research Council of Canada (NSERC) and Fonds de recherche du Québec (FRQNT). Yewon Kim is supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2022-0-00495, On-Device Voice Phishing Call Detection). Daniel Buschek is supported by the Bavarian State Ministry of Science and the Arts in a project coordinated by the Bavarian Research Institute for Digital Transformation (BIDT). Daniel Buschek is also supported by a Google Research Scholar Award. Antoine Bosselut is supported by Swiss National Science Foundation (No. 215390), Innosuisse (PFFS-21-29), Sony Group Corporation, and Allen Institute for AI.

A Author Contributions

This project was a large collaboration with 36 researchers across 27 institutions. This team effort was built on countless contributions from everyone involved. To acknowledge individual authors’ contributions and enable future inquiries to be directed appropriately, we listed authors in three different ways in our paper and artifact.

A.1 Overall Author List

In the beginning of the project, each author signed up for one of the four roles in the project. Project leads oversaw the entire project, supporting team leads and members. Team leads kept team members on track, provided feedback on literature review and writing, and maintained alignment with the project’s direction. Team members contributed to decision making, conducted extensive literature review, and wrote the paper. Advisors, although sometimes not directly involved in literature review and writing, provided additional guidance and feedback. Some authors took on two roles, occasionally blurring the role distinctions. The following list contains each author’s name, affiliation, and contributions, grouped by their main self-assigned role.

Project leads

•

Mina Lee (Microsoft Research & University of Chicago): Led and managed the overall project, prepared weekly project meetings, filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), participated in writing (initial, revision), designed figures, and open-sourced the artifact

•

Katy Ilonka Gero (Harvard University): Led and managed the overall project, led the user team, created and managed resources for coding papers, filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), and participated in writing (initial, revision)

•

John Joon Young Chung (Midjourney): Led the systematic literature review process, sampled papers, filtered papers, designed dimensions and codes (initial), coded papers (initial), participated in writing (initial), and analyzed annotations

Team leads (alphabetical)

•

Simon Buckingham Shum (University of Technology Sydney): Led the ecosystem team, prepared weekly team meetings, filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), and participated in writing (initial, revision)

•

Vipul Raheja (Grammarly): Led the technology team, prepared weekly team meetings, sampled extra papers, filtered papers, designed dimensions and codes (initial), coded papers (initial), and participated in writing (initial, revision)

•

Hua Shen (University of Michigan): Led the interaction team, prepared weekly team meetings, filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), and participated in writing (initial, revision)

•

Subhashini Venugopalan (Google): Led the technology team, sampled extra papers, filtered papers, designed dimensions and codes (initial), coded papers (initial), and participated in writing (initial)

•

Thiemo Wambsganss (Bern University of Applied Sciences): Led the interaction team, prepared weekly team meetings, filtered papers, designed dimensions and codes (initial), coded papers (initial), and participated in writing (initial)

•

David Zhou (University of Illinois, Urbana-Champaign): Led the task team, prepared weekly team meetings, filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), and participated in writing (initial, revision)

Team members (alphabetical)

•

Emad A. Alghamdi (King Abdulaziz University): Filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), and participated in writing (initial)

•

Tal August (Allen Institute for AI): Designed dimensions and codes (initial), coded papers (initial), and participated in writing (initial)

•

Avinash Bhat (McGill University): Filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), identified issues in papers and coding results, and participated in writing (initial)

•

Madiha Zahrah Choksi (Cornell Tech): Filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), and participated in writing (initial)

•

Senjuti Dutta (University of Tennessee, Knoxville): Filtered papers, designed dimensions and codes (initial), coded papers (initial), and participated in writing (initial)

•

Jin L.C. Guo (McGill University): Designed dimensions and codes (initial, revision), and coded papers (initial, revision), and participated in writing (initial)

•

Md Naimul Hoque (University of Maryland, College Park): Filtered papers, designed dimensions and codes (initial, revision), and coded papers (initial, revision)

•

Yewon Kim (KAIST): Filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), participated in writing (initial, revision), and identified issues in papers and coding results

•

Simon Knight (University of Technology Sydney): Designed dimensions and codes (initial, revision)

•

Seyed Parsa Neshaei (EPFL): Filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), and participated in writing (initial, revision)

•

Antonette Shibani (University of Technology Sydney): Designed dimensions and codes (initial, revision) and coded papers (revision)

•

Disha Shrivastava (Google DeepMind): Designed dimensions and codes (initial) and coded papers (initial)

•

Lila Shroff (Stanford University): Filtered papers, designed dimensions and codes (initial), coded papers (initial), and participated in writing (initial)

•

Agnia Sergeyuk (JetBrains Research): Filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), participated in writing (initial, revision), and identified issues in papers and coding results

•

Jessi Stark (University of Toronto): Filtered papers, designed dimensions and codes (initial), coded papers (initial), and participated in writing (initial)

•

Sarah Sterman (University of Illinois, Urbana-Champaign): Filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), participated in writing (initial, revision), designed interaction framework, and analyzed annotations

•

Sitong Wang (Columbia University): Filtered papers, designed dimensions and codes (initial), coded papers (initial), and participated in writing (initial)

Advisors (alphabetical)

•

Antoine Bosselut (EPFL): Filtered papers, coded papers (initial), participated in writing (initial)

•

Daniel Buschek (University of Bayreuth): Filtered papers, designed dimensions and codes (initial, revision), coded papers (initial, revision), and participated in writing (initial, revision)

•

Joseph Chee Chang (Allen Institute for AI): Filtered papers, designed dimensions and codes (initial), coded papers (initial), and participated in writing (initial)

•

Sherol Chen (Google): Designed dimensions and codes (initial)

•

Max Kreminski (Midjourney): Filtered papers and designed dimensions and codes (initial)

•

Joonsuk Park (University of Richmond): Sampled extra papers, designed dimensions and codes (initial), and participated in writing (initial)

•

Roy Pea (Stanford University): Designed dimensions and codes (initial) and participated in writing (initial, revision)

•

Eugenia H. Rho (Virginia Tech): Designed dimensions and codes (initial) and participated in writing (initial, revision)

•

Shannon Zejiang Shen (Massachusetts Institute of Technology): Designed dimensions and codes (initial), participated in writing (initial), designed tables, and ideated and open-sourced the artifact

•

Pao Siangliulue (B12): Filtered papers, designed dimensions and codes (initial), and participated in writing (initial)

A.2 Team-Specific Author Lists

As described in Section 3.3.2, the authors were split into five teams to develop team-specific dimensions and codes based on the five aspects. Each team operated as its own project group in that most teams had separate weekly meetings. Below, we specify a team-specific author list for each team. The ordering follows the convention in Computer Science (e.g., the first person in the team member group has the most contribution and the last person in the advisor group has the most contribution within each group).

•

Task: David Zhou (team lead), Agnia Sergeyuk (team member), Jessi Stark (team member), Emad A. Alghamdi (team member), Sitong Wang (team member), Roy Pea (advisor)

•

User: Katy Ilonka Gero (team lead), Yewon Kim (team member), John Joon Young Chung (team member), Senjuti Dutta (team member), Lila Shroff (team member), Disha Shrivastava (team member), Eugenia H. Rho (advisor)

•

Technology: Vipul Raheja (team lead), Subhashini Venugopalan (team lead), Seyed Parsa Neshaei (team member), Disha Shrivastava (team member), Antoine Bosselut (advisor), Sherol Chen (advisor), Joonsuk Park (advisor)

•

Interaction: Hua Shen (team lead), Thiemo Wambsganss (team lead), Sarah Sterman (team member), Tal August (team member), Avinash Bhat (team member), Md Naimul Hoque (team member), Jin L.C. Guo (team member), Pao Siangliulue (advisor), Joseph Chee Chang (advisor), Max Kreminski (advisor), Shannon Zejiang Shen (advisor), Daniel Buschek (advisor)

•

Ecosystem: Simon Buckingham Shum (team lead), Madiha Zahrah Choksi (team member), Antonette Shibani (team member), Simon Knight (team member)

A.3 Core Group of Annotators

Most authors coded papers as part of designing and refining dimensions and codes. During this process, they focused on the team-specific dimensions and codes and looked at a subset of the papers that were relevant to their teams. After several iterations within each team, we created an initial version of the design space. Then, a subset of the authors volunteered to be the core group of annotators and coded all papers for all dimensions and codes (beyond their own teams). This relatively small group allowed us to be more efficient and reduce communication overhead. Here, we list the authors who spearheaded and annotated the papers as part of this core group in the order of their contributions (i.e., the first person annotated the highest number of papers) as well as the authors who helped with creating the living artifact.

•

Annotators: Avinash Bhat, Simon Buckingham Shum, Agnia Sergeyuk, Yewon Kim, David Zhou, Emad A. Alghamdi, Jin L.C. Guo, Seyed Parsa Neshaei, Hua Shen, Md Naimul Hoque, Madiha Zahrah Choksi, Katy Ilonka Gero, Sarah Sterman, Antonette Shibani, Mina Lee

•

Artifact designers: Shannon Zejiang Shen, Mina Lee

B Terminology

Throughout the paper, we use “intelligent and interactive writing assistants” and “writing assistants” interchangeably. Here, we describe the distinctions we make between “writing assistants” and other related terms: “models,” “systems,” and “technology”. Firstly, we use “writing assistants” to refer to computational systems that assist users with their writing. These writing assistants must have the frontend that can interface with users, whereas the other three terms do not. We use “model” to refer to a specific model (e.g., a specific instance of GPT-3.5, such as gpt-3.5-turbo). We use “system” quite broadly in the paper. Beyond using it as a concise way to refer to writing assistants, it can refer to 1) a model (but in this case, we prefer to say “model” to be more specific), 2) a model + alpha (e.g., ChatGPT which is GPT-3.5 with extra safety filters on top, a tool-augmented LM where a model has access to external resources), and 3) anything that is not a model-based system (e.g., rule-based system). The use of “system” in Section 4.4 is an example of the second case. We use “technology” as a much broader concept than “model” or “system” in that it incorporates data, model, learning, evaluation, and beyond.

C Additional Background

C.1 Technological Evolution in Writing Assistants

Since the model architectures used to learn patterns have evolved hand-in-hand to consume and capture patterns in increasing amounts of data, we discuss these two aspects together here. A number of works on writing assistants in the early 2000s [26, 155, 190, 218] used purely human-labeled data to develop rule-based methods [26] or train statistical models [10, 190], which were often used to detect errors [26, 190] or suggest corrections. These models were trained on much smaller datasets consisting of hundreds or a few thousands of examples. With developments in model architecture such as statistical ML models and deep neural networks [51, 170], models behind writing tools could take advantage of larger sources of unlabeled data. For instance, some work started using learned word embeddings or trained embeddings specifically for a task [56, 84, 264]. They were then able to bootstrap off this to use smaller human-labeled datasets for further modeling and evaluation. With deep sequence to sequence models [42, 236] writing assistants were able to use a combination of human-labeled and machine-labeled data with several thousand sentences or examples [158, 160]. Several recent works [35, 45, 58, 65, 172, 186, 265] take advantage of the large language models that are already pre-trained on corpora of millions of sentences. They are then able to use small amounts of human-labeled data to tune the model (e.g., instruction tune [201, 215, 223]) and in many cases simply prompt the model in a zero-shot manner to help with the writing task.

C.2 User Interaction & Interface Evolution in Writing Assistants

The origin of generating text to support writers lies in augmentative and alternative communication (AAC) research. The goal of these text entry methods is to reduce manual typing, in particular, for people with motor impairments. This was typically realized by predicting next words, and showing them in the user interface for people to select directly, saving letter-by-letter input efforts [76, 101]. These ideas were later applied more generally to improve input efficiency [141]. Here we see two interlinked developments: On the technical side, systems evolved from using simple n-gram models to deep learning with today’s LMs. This enabled coherent, longer generation as well as the recent prompting paradigm. This in turn impacted the interaction and UI design.

Generated text shown in the UI can now be longer, evolving from single words [67, 76, 91, 197], to phrases [8, 27, 41, 146, 245], to whole drafts (e.g., based on keywords or an incoming email [126]), depending on use cases. Related, interactions for controlling text generation have become much more varied. As one key distinction [58], users can implicitly steer text completion with their preceding draft text, or write explicit instructions to the system. This instruction style is currently often combined with a user interface design—a chat history with the system (e.g., ChatGPT [183])—that differs from the traditional page view. Other emerging UI patterns in this context include sidebars for suggestions [265] or annotations [57], as well as other separate views and tools for engaging with AI text [84, 85]. In other designs, AI-generated text is directly integrated into the user’s writing area (e.g., previewed in a light grey font color [41]) or can be selected from a pop-up list at the cursor [27, 58, 146]. Beyond linear text, further UI concepts include sketching [45] and views that arrange (generated) text on a 2D canvas, for example, as a graph [124] or post-it notes.⁸

C.3 Ecosystem: Going from Micro-HCI to Macro-HCI

Historical accounts of HCI [209, 222] document how the definition of “the system” was first enlarged from the computer to include the human user, starting with attempts to apply psychological models of the individual human using a computer. This frame then expanded to include the ways in which the wider sociotechnical system impinged on (and was shaped by) how interacting groups of people appropriated technology, requiring theory and methods from many more disciplines such as design, ecological psychology, sociology, anthropology, critical theory and more, to create today’s HCI landscape. Borrowing an art history metaphor, Rogers [209] characterizes this as the evolution of HCI theory from classical, to modern to contemporary, while Shneiderman [222] uses the language of micro-HCI and macro-HCI:

“Micro-HCI researchers and developers design and build innovative interfaces and deliver validated guidelines for use across the range of desktop, Web, mobile, and ubiquitous devices. The challenges for micro-HCI are to deal with rapidly changing technologies while accommodating the wide range of users.” [...] “Macro-HCI researchers and developers design and build interfaces in expanding areas, such as affective experience, aesthetics, motivation, social participation, trust, empathy, responsibility, and privacy. [...] Macro-HCI researchers have to face the challenge of more open tasks, unanticipated user goals, new measures of system efficacy, and even conflicts among users in large communities.”

In these terms, the design space relating to the task, user, technology, and interaction aspects describes the micro-HCI level, but as writing assistants become embedded in broader sociotechnical contexts, this must extend to macro-HCI concerns beyond the individual writer and software, to what we term the ecosystem.

D Systematic Literature Review

D.1 Venues

To keep the number of papers reasonable, we decided to focus on papers from the following venues. We included all their paper tracks (e.g., CHI Late-Breaking Work and ACL Findings), but excluded workshops.

•

HCI: CHI, CSCW, UIST, IUI, C&C, DIS, and ToCHI

•

NLP: ACL, NAACL, EMNLP, EACL, and TACL

D.2 Keywords

When retrieving candidate papers from ACM DL, we simply used “write” as our keyword, since ACM DL supports automatic matching of variations. Any paper that has the keyword in its title or its keywords was retrieved. When retrieving candidate papers from ACL Anthology, we used “writ” and “wrote” as our keywords to manually account for the verb “write”’s variations (“write,” “writes,” “writing,” “wrote,” and “written”). Because papers on ACL Anthology do not have associated keywords, we retrieved any paper that has at least one of the keywords in its title. Despite the differences, we retrieved the similar number of papers: 60 papers in HCI and 55 papers in NLP.

D.3 Collected Papers

•

HCI: [2, 7, 10, 12, 13, 15, 16, 18, 28, 34, 45, 47, 57, 58, 64, 71, 80, 83, 84, 85, 86, 89, 90, 105, 107, 108, 113, 114, 120, 121, 125, 128, 130, 132, 133, 146, 148, 156, 172, 178, 179, 185, 189, 190, 193, 195, 203, 207, 208, 213, 217, 226, 244, 249, 252, 254, 259, 265]

•

NLP: [3, 9, 11, 26, 29, 35, 37, 38, 43, 48, 54, 56, 68, 77, 77, 88, 93, 96, 97, 98, 104, 110, 111, 117, 123, 137, 144, 155, 158, 159, 160, 171, 177, 186, 196, 204, 218, 220, 225, 227, 228, 229, 235, 238, 243, 247, 248, 251, 255, 261, 264, 268, 269, 272, 273, 274]

D.4 Additional References for Technology

As described in Section 3.3.2, the technology team selected 25 additional papers to ensure a broader, deeper, and more relevant coverage of recent technologies. The paper selection process was identical to the one in Section 3.3, but with an expanded set of search keywords, such as “text revision” and “text editing,” that are more likely to appear in NLP papers. Furthermore, the team retrieved papers from an expanded set of venues. The initial set was 80 papers, which were then deduplicated based on the common pool, and then filtered and adjudicated by two authors based on their relevance to the technology aspect. Finally, the team referenced the papers chosen by both the authors as relevant to the aspect, leading to a set of 25 papers. Note that some of these papers were considered as out of scope based on the criteria in Section 3.1, but were still relevant as they are concerned with AI models built for writing-related tasks (e.g., LMs fine-tuned for specific writing tasks, such as text composition or revision). The team used these papers to develop their codes, but did not include them as part of their literature review.

•

Technology: [4, 6, 59, 66, 70, 72, 118, 119, 136, 150, 165, 166, 167, 174, 175, 201, 202, 205, 215, 223, 230, 232, 234, 260, 262, 271]

Footnotes

^⁎

Corresponding author: Mina Lee at the University of Chicago ([email protected]) and Microsoft Research ([email protected]). We denote each author’s self-assigned role with the following superscripts: 1 for project leads, 2 for team leads (alphabetical), 3 for team members (alphabetical), and 4 for advisors (alphabetical). Please see Appendix A for the full author list with their roles, affiliations, and contributions.

While there are other relevant fields (e.g., Machine Learning, Cognitive Science, Writing, and Education), we excluded them as they usually do not consider all of the three elements (i.e., intelligent, interactive, and writing) that define our scope (Section 3.1). Nevertheless, we note that many of the selected papers tend to be interdisciplinary and cover various subcommunities.

Krippendorff’s alpha is applicable when multiple coders see only partial instances, which is our case as many coders contributed to coding all the papers. However, Krippendorff [140] suggest that this metric is only reliable for binary classification (in our case, codes either occur or do not) when certain conditions (e.g., minimum code occurrence) are met; for this reason we excluded codes that occur less than 50 times and considered 72 codes for calculating Krippendorff’s alpha.

Note that our writing stages differ from the cognitive processes of writing proposed by Flower and Hayes [73], despite the similarity in terminology (they use “planning,” “translating,” and “reviewing” to describe writing subprocesses). Rather, our writing stages resemble (yet are not the same as) the stage models of writing, such as Britton et al. [21], Rohman [210], that model the growth of the written artifact rather than the inner working of the writer. We choose this approach because we find it more intuitive to design writing assistants to support a writing stage as a high-level cluster of relevant cognitive processes, compared to designing one writing assistant for each cognitive process. To illustrate the distinction, consider writing an outline of a paper as an example of a task in the planning stage. There are multiple cognitive processes involved in the task, such as figuring out what to write (“planning”), jotting these ideas down as bullet points (“translating”), and reorganizing these points to enhance the overall flow (“reviewing”). Therefore, there is a natural one-to-many relationship between our writing stages and the cognitive processes of writing.

⁴

Note that we defined the size of a dataset with respect to the number of examples, not the total number of words in the dataset. As examples can vary in length, ranging from a single word to the length of a book, the actual size of the dataset doesn’t necessarily correlate with the number of examples.

⁵

Bommasani et al. [17, §1] provides an overview on AI research over the last 30 years.

⁶

Note that this is a hypothetical scenario based on Kim et al. [135].

⁷

Note that this is a hypothetical scenario based on Jakesch et al. [120].

⁸

https://fermat.app/

Supplemental Material

MP4 File - Video Presentation

Video Presentation

Transcript for: Video Presentation

CSV File - Raw Paper Annotations

This file contains raw annotations where each paper was annotated by two annotators. The rows are the papers in our systematic literature review. The columns include metadata of the papers and all dimensions in our design space. Each cell value represents specific code(s) in our design space that a paper (row) is annotated with for the respective dimension (column).

Download
106.66 KB

CSV File - Processed Paper Annotations

This file contains processed annotations where each paper has one aggregated annotation in order to create a living artifact. The aggregation was done based on taking additional annotations by team leads (considered as "ground truth"), identifying intersection between two annotators' annotations, and merging two annotators' annotations when there is no intersection. The format is the same as for the raw paper annotations.

Download
59.39 KB

References

[1]

Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent anti-muslim bias in large language models. arXiv preprint arXiv:2101.05783 (2021).

Abstract

1 Introduction

2 Background

3 Approach

3.1 Scope

3.2 Five Aspects of Writing Assistants

3.3 Systematic Literature Review

3.3.1 Paper Selection.

3.3.2 Code Development.

3.3.3 Final Design Space and Coding.

4 Design Space

4.1 Task

4.1.1 Dimensions and Codes.

4.2 User

4.2.1 Dimensions and Codes.

4.3 Technology

4.3.1 Dimensions and Codes.

4.4 Interaction

4.4.1 Dimensions and codes.

4.5 Ecosystem

4.5.1 Dimensions and codes.

5 Discussion

5.1 Use Case Scenarios for the Design Space

5.2 Trends and Gaps in the Literature

5.3 Ethical Implications of Writing Assistants

5.4 Challenges in Developing a Design Space

5.5 Limitations and Future Work

5.6 Lessons for Future Writing Assistants

6 Conclusion

Acknowledgments

A Author Contributions

A.1 Overall Author List

A.2 Team-Specific Author Lists

A.3 Core Group of Annotators

B Terminology

C Additional Background

C.1 Technological Evolution in Writing Assistants

C.2 User Interaction & Interface Evolution in Writing Assistants

C.3 Ecosystem: Going from Micro-HCI to Macro-HCI

D Systematic Literature Review

D.1 Venues

D.2 Keywords

D.3 Collected Papers

D.4 Additional References for Technology

Footnotes

Supplemental Material

References

Cited By

Index Terms

Recommendations

The Second Workshop on Intelligent and Interactive Writing Assistants

Dark Sides: Envisioning, Understanding, and Preventing Harmful Effects of Writing Assistants - The Third Workshop on Intelligent and Interactive Writing Assistants

Intelligent assistants for handicapped people's independence: case study

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access