The technology aspect of writing assistants considers the advancements that underpin the intelligence and capabilities of the systems. We aim to describe the end-to-end process of developing underlying models that can be used for writing assistants, considering learning problem formulation, data properties, modeling techniques, evaluation methodologies, and large-scale deployment considerations, all of which play a crucial role in determining the quality and degree of intelligence in the writing assistants.
4.3.1 Dimensions and Codes.
Figure
1 (“Technology”) shows technology dimensions in a broad context, while Table
3 and
4 list all dimensions, codes, and definitions.
Data - Source:.
Who is the creator of the data used to train or adapt a model? The source of the data used to develop a system or train a model can have a direct effect on the system’s overall performance and reliability. A dataset can be authored by
experts who have domain knowledge of the specific downstream task [
3,
128,
255,
273], or
users of the system, during their interaction with the writing assistant [
9,
27,
177,
252]. However, due to the difficulty of recruiting real experts and users, many researchers resort to
crowdworkers to create data or annotate data entries [
35,
130,
251,
272]. Sometimes,
authors themselves participate in the preparation and annotation of the dataset [
125,
193,
195,
269]. Recently, we see more datasets that are generated by a
machine [
105,
123,
196,
227], which has the advantage of being relatively cheap and fast to generate at scale compared to human-generated datasets. Finally, there are
other types of creators such as non-experts, unspecified individuals, or a broad set of creators (e.g., in the case of web crawled data) [
13,
225,
265,
274].
Data - Size:.
What is the size of the dataset4 used to train or adapt a model? Depending on the size of a dataset required to train or adapt (e.g., fine-tuning or prompting) a model, there can be a huge overhead in terms of data collection. While some models can be developed using very
small data (between 1 to 100 examples) [
10,
85,
156], the others require much larger data. If the training needs more data (around 100 to few thousands of examples) which is often the case for fine-tuned models, we categorize them as
medium [
37,
97,
252,
253,
268]. For larger datasets (around tens of 1000s of examples) we denote this as
large [
56,
204,
254]. For models that undergo extensive large-scale pre-training, we categorized data used in this process as
extremely large to indicate a dataset of millions of examples [
43,
220,
225,
273] or more. We also included an
unknown if the paper did not explicitly mention the dataset used for training [
178,
190,
235].
Model - Type:.
What is the type of the underlying model? Advancements in AI accelerators and the availability of large amounts of data have led to an evolution in model architectures,
5 which we capture as the following four types. First,
rule-based models rely on pre-defined logic, lookup tables, regular expressions, or other similar heuristic approaches that are deterministic in nature [
10,
29,
90,
218]. For
statistical machine learning (ML) models, we consider models that are trained from scratch on historical data, are not necessarily “deep” (as in deep neural networks), and are used to make future predictions (e.g., support vector machines and logistic regression) [
111,
121,
193,
268]. Over the past decade,
deep neural networks have been the popular models of choice for writing assistants, including recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) [
7,
47,
208,
259]. Finally, recent works have increasingly utilized
foundation models, such as BERT [
61], RoBERTa [
157], GPT [
23,
198,
199], and T5 [
200], to name a few. A foundation model is “any model that is trained on broad data that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks” [
17]. These models can perform a wide range of tasks out of the box, learn from a few examples to provide tailored support to users, and be further fine-tuned for specific downstream task(s) [
146,
220,
251,
255].
Model - External Resource Access:.
What additional access does the model have at inference time? Recently, models have been developed with access to additional tools or data at inference time to make them capable of providing assistance beyond the knowledge encoded in their parameters. In the case of
tool, a model may access external software or third-party APIs to perform tasks like search, translation, or calculator, or even setting calendar events on behalf of users [
43,
178,
264,
269].
Data refers to external datasets or resources, such as information stored in a database, external knowledge repositories, or any other structured/unstructured data sources that the models might leverage to provide writing assistance [
132,
220,
227,
273].
Learning - Problem:.
How is the writing assistance task being formulated as a learning problem? How exactly writing assistants support their users usually varies based on the learning problem their underlying models are designed to solve.
Classification refers to the class of problems that require categorizing data into predefined classes based on their attributes. It is one of the most widely formulated classes of problems in writing assistants, applicable to tasks such as detecting errors in writing [
77,
243] and detecting the purpose of writing revisions [
123], among others. In contrast,
regression problems involve the prediction of a continuous numerical value or quantity as the output instead of categorical labels or classes. This includes problems such as the prediction of the writer’s sentiments [
249], the readability [
128], or the emotional intensity [
237] of written text as numerical ratings or scores.
Structured prediction refers to a class of learning problems that involve predicting structured outputs or sequences (e.g., sequences, trees, and graphs) rather than single, isolated labels or values. Numerous works have focused on developing these approaches to make edits to improve the quality of written text during the revision stage [
136,
165,
166,
167,
230].
Rewriting problems involve sequence transduction tasks, where texts from one form are transformed to another while improving the quality by making them fluent, clear, readable, and coherent. These tasks are essential in various writing assistance applications, such as grammatical error correction [
37,
43,
274], paraphrasing [
264], or general-purpose text editing [
66,
70,
201,
223] to name a few.
Generation refers primarily to problems that involve the creation of new, contextually relevant, coherent, and readable text from relatively limited inputs, such as autocomplete, paraphrasing, and story generation [
7,
45,
116,
217].
Retrieval problems take the input from a user as a query (e.g., keywords) and search in a knowledge base or dataset for relevant information. Such problems may involve ranking the available data based on its relevance and similarity to the input but do not necessarily include the generation of new text beyond what is available in the knowledge base [
37,
220,
229].
Learning - Algorithm:.
How is the underlying model trained? The models used as back-bones of writing assistants incorporate different training mechanisms based on the type of the available data, as well as the specific downstream tasks. In
supervised learning, models are trained on a labeled dataset where each input is associated with the correct output. Some of the commonly used methods include Logistic Regression, Random Forests, and Naive Bayes [
16,
26,
97,
253]. Supervised learning also includes approaches such as Transfer Learning, which involves training a model on a large dataset and then fine-tuning it for a specific task or domain using a smaller dataset [
68,
274]. In
unsupervised learning, models are trained on unlabeled data to learn patterns and structures within the data. This approach includes techniques such as representation learning and clustering methods, to name a few [
185,
254].
Self-supervised learning approaches train models on unlabeled data with a supervisory signal [
81]. These approaches leverage the benefits of both supervised and unsupervised learning, especially in scenarios where obtaining a large amount of labeled data is challenging. This includes pre-training objectives for large language models such as Causal Language Modeling [
199] and Masked Language Modeling [
61]. In
reinforcement learning (RL), models learn by interacting with an environment and receiving feedback in the form of rewards. This approach is useful for tasks requiring action sequences, such as language generation and dialogue systems [
225].
Learning - Training and Adaptation:.
How is the underlying model being trained or adapted for a specific task at hand? The training and adaptation process is integral part of developing an intelligent model that can perform tasks at hand and support user needs. Many models used to be
trained from scratch [
83,
98,
158,
235] before foundation models. On the other hand, with the advance of foundation models (e.g., BERT and GPT-4), the common learning paradigm has been shifted to “pre-training” a large model on broad data and then “adapting” the model to a wide range of downstream tasks. One way to adapt a model is
fine-tuning, where the pre-trained model is further trained on a specific dataset [
13,
195,
226,
272]. Note that there are numerous variants of fine-tuning, such as
transfer learning,
instruction tuning,
alignment tuning,
prompt tuning,
prefix tuning, and
adapter tuning, among others. Another way to adapt a model is
prompt engineering (or “prompting”), where one can simply provide a natural language description of the task (or “prompt”) [
22] to guide model outputs [
58,
120,
146,
172]. A prompt may include a few examples for a model to learn from (“few-shot learning” or “in-context learning”). Lastly, we can
tune decoding parameters of a model to influence model outputs (e.g., changing temperature to make outputs more or less predictable, manipulating logit bias to prevent some words from being generated) [
88,
146,
218].
Evaluation - Evaluator:.
Who evaluates the quality of models outputs? A core aspect of model development is its evaluation. We consider four common types of evaluators who can review and evaluate various qualities of model outputs (as opposed to writing assistants or user interactions).
Automatic evaluation compares machine-generated outputs with human-generated labels or texts using aggregate statistics or syntactic and semantic measures. These include metrics like precision, recall, F-measure, and accuracy, as well as ones used in generation tasks such as BLEU [
188], METEOR [
142], and ROUGE [
263] to name a few [
3,
37,
235].
Machine-learned evaluation uses automated metrics, which are themselves produced by a machine-learned system. These are typically classification or regression models that are trained to evaluate the quality of model outputs [
123,
196,
219,
249,
270]. On the other hand,
human evaluation corresponds to evaluating the system with human annotators either directly interacting with, or evaluating the output of a writing assistant. Some evaluations may require the judging of task-specific criteria (e.g., evaluating that certain entity names appear correctly in the text [
173]), while others can be generalized for most text generation tasks (e.g., evaluating the fluency or grammar of the generated text [
156,
163,
247,
265]).
Human-machine evaluation captures cases where both machine-learned metrics or models and human judges are involved in the evaluation of the outputs. This hybrid evaluation is particularly relevant in co-creative, mixed-initiative writing assistance settings. Such studies often involve expert users and participatory methodologies [
45,
146,
154,
172].
Evaluation - Focus:.
What is the focus of evaluation when evaluating individual model outputs? Evaluating (or benchmarking) models has been a long standing challenge in NLP [
152]. In particular, as we increasingly use foundation models (e.g., GPT-4) for a wide range of downstream tasks, it is difficult to evaluate the quality of model outputs across all tasks, let alone the difficulty of evaluating open-ended generation. Here, we highlight four common evaluation focus relevant to writing assistants in the literature.
Linguistic quality focuses on the grammatical correctness, readability, clarity, and accuracy of the model’s outputs. This aspect ensures that the outputs are not only correct in terms of language use but also easily understandable and precise in conveying the intended message [
58,
128,
251,
273].
Controllability assesses how well the model’s outputs reflect constraints (or control inputs) specified by users or designers. For instance, how effectively the model adheres to any specific level of formality or writing style [
120,
195,
217,
238]. Furthermore, it is crucial that the model’s responses not only make sense in isolation, but also fit seamlessly within the broader context of the text.
Style & adequacy pertains to the alignment between the model’s outputs and their surrounding texts or contexts. This includes evaluating the stylistic and semantic coherence, relevance, and consistency of the outputs with the given context [
84,
160,
178,
265]. Finally,
ethics encompasses a range of crucial considerations such as bias, toxicity, factuality, and transparency. Ethics focuses on the model’s outputs adherence to social norms and ethical standards, and seeks to avoid generating outputs that contain harmful biases, misinformation, and other unethical elements [
15,
105,
193,
227]. This aspect of evaluation is particularly critical in maintaining the trustworthiness and societal acceptance of the model.
Scalability:.
What are the economic and computational considerations for training and using models? Recent models, especially LMs, have demonstrated exceptional performance across various tasks [
25,
40,
184]. However, the significantly large size of these models has substantially increased the
cost of their development [
127]. In this regard, directly utilizing pre-trained LMs via prompting [
23,
257] or employing efficient fine-tuning methods like low-rank adaptation [
109] and prefix-tuning [
151] can help avoid the cost of full fine-tuning. During deployment, this affects not only the inference costs but also the
latency, which often degrades user experience [
30,
147]. Techniques such as quantization [
87] and knowledge distillation [
103] have shown promising results in addressing these issues.