From our analysis of the conversations during our semi-structured interviews, we found common steps in the users’ prompting journey with Midjourney. These steps, which are illustrated in Figure
1, were found to include
prompt structure, image evaluation, and prompt refinement processes. In this section, we first detail how our participants begin their prompt journey, employing different structures driven by their underlying motivations. Second, we surface what factors influence participants’ evaluation approaches and criteria, including visual design, and image content. Third, we detail how participants refine their prompt by reflecting on their evaluations and iterating cycles until satisfied or giving up. Lastly, we discuss two themes of challenges in participants’ prompt journey. While we present these steps independently, we observed that the users’ experience with crafting prompts was a recurring, iterative process.
4.1 Users’ Prompt Journey: Prompt Structures
When initiating prompts, our participants employed a variety of techniques to control the overall structure of prompts, which we categorized into five structures:
Descriptive Sentences, Templates, Overview + Detail, Chunks, and Word Sequences. A summary of these structures, along with examples, is provided in Table
1.
4.1.1 Descriptive Sentences.
The least granular approach to structuring prompts used by our participants (n=6) involved treating prompts as formal writing and using regular descriptive sentences. P12 described their approach as just thinking of prompts "as a sentence." This strategy was favored by P3, P7, and P9 mainly for its familiarity as it resembles everyday writing or communicating with others. Additionally, P12 also mentioned “it’s probably just me being lazy more than anything else. It’s simpler for me [...] I’m not a full-time creative, so I don’t have 8 hours a day to sit here and just experiment.” This suggests that the simplicity and ease of this approach make it a convenient and straightforward choice for structuring prompts.
4.1.2 Templates.
Three participants adopted the template structure for their crafting prompts, employing fixed keywords or phrases as a foundational framework and customizing it with their specific content. These templates acted as versatile structures for participants to adapt to different words. For instance, P2 followed their self-devised template for crafting "a full-scale art piece," using the format "[x] against [y] background or in [z] environment." Here, the "[x]", "[y]", and "[z]" placeholders were filled with specific elements to shape the visual composition of the art piece. Whereas P5 used a template inspired by a prompt they saw on a Midjourney channel, as a means of exploration stating they "have been playing around with this concept of ’something made out of something.’" This led them to combinations like "cactus made out of candy". Finally, P19, an architect, developed a personalized template for exploring architectural variations: “[architectural piece]designed by [designer]surrounded by [environment],” facilitating thematic exploration within their domain. These diverse approaches demonstrate the versatility of the template strategy, catering to individual goals and motivations in image generation, from P2’s targeted goal to P5’s exploratory method and P19’s thematic exploration.
4.1.3 Overview + Detail.
One structuring strategy frequently used by our participants is the overview + detail structure which divides the prompts into two main parts. The first part, often shorter and referred to as the head (or overview), presents the main concept or content, while the second part, often called tail (or detail), provides details on how the main concept should be illustrated. Overview + detail is more granular than descriptive sentences and more flexible than a template, making them particularly well-suited for many of our participants (n=8). P14 described their methodical use of the structure: "you can prompt in a way where you’re kind of giving it a description and then there’s also the tail end of that where you can start putting things like certain camera settings [...] I tend to look at those as two sections."
Three participants highlighted the benefit of this structure for its adaptability and reproducibility. P2 noted, "that’s exactly how I’m replicating the same thing [style], no matter what subject I’m pushing," indicating how the detailed section of the prompt helped in honing a unique visual aesthetic. Similarly, P4 emphasized the reusability of the detail section, stating, "after you have what it is, [...] then you can get into the personality you’re trying to achieve" and further explaining, the stylistic elements "tend to be the things I always hold true to, like I repeat time and time again." Furthermore, P14 described their practice of maintaining a list of stylistic words for easy integration into the detail section: "I can just kind of copy and paste from different descriptions for different settings at the tail end of it."
P13, 14, and 16 adopted this method after observing others’ prompts. P14, in particular, was inspired by discussions in the Midjourney Discord: "going through the discord chats hearing about how people were kind of talking about this stuff and what they work with is kind of where I picked up my technique."
4.1.4 Chunks.
In our analysis of the various prompt structures used by participants, the chunk structure emerged as the one offering the highest level of control. This approach, as described by P9, allows for a multifaceted prompt construction, where different aspects such as subject, scene, colors, and style can be individually addressed: “Sometimes I’ll do multi-prompt where one part of the prompt will be like the subject and the next would be maybe the scene and then the next would maybe be the colors and maybe the style I wanted in.”
The distinct advantage of the chunk structure lies in its ability to provide hard stop using double colons (::) and individual weighting to each segment of the prompt mentioned by P6 and P10. These weights, assigned to each chunk, act as dials that control the AI model’s attention and interpretation. Positive weights increase the emphasis placed on a particular section, while negative weights decrease it. As P6 explained, “a double colon works as a hard stop and [...] that allows me to weigh every single phrase individually.” P4 elaborates on this, noting that each chunk acts as “the next layer of information,” thereby allowing for more precise direction and nuanced control over the AI’s interpretation process.
For example, in a prompt like "a beautiful anime cat:: with glasses::5 it has long dark brown hair::-2", "a beautiful anime cat" is treated with the default weight, meaning it has a neutral influence on the final outcome. However, the prompt instructs Midjourney to prioritize adding glasses, indicated by the positive weight of 5. Conversely, the inclusion of "long dark brown hair" becomes less likely due to the negative weight of -2. This demonstrates how strategic weighting allows for precise control over the AI model’s focus and ultimately influences the final artistic outcome.
4.1.5 Word Sequences.
The most granular strategy used by our participants was word sequencing which is a list of comma-separated terms without specific structural format (e.g., "futuristic clothing, rainy, foggy, cyberpunk, Neo, Tokyo.”) This structure was adopted by P3 who was unsure about how to structure prompts effectively stating, "I’m not sure of the structure of the prompts, how to do them properly. I just throw everything in." (P3) However, even within this seemingly simple structure, there remains a level of uncertainty about how much detail is required. For example, P8 after seeing extensive keyword lists in others’ prompts, skeptically noted, "some people will post up in their prompts, 1000 different renderers, like here we go [showing a long prompt], like does any of this really matter?” In contrast, P15 deliberately chose this approach based on their observation of the Midjourney’s responses to more structured sentences: “Sometimes I realize even putting full sentences can mess up the AI. So, it [my prompt] is marionette, puppet, puppeteer, realistic, detailed."
The choice of Word Sequences structure was also informed by some (n=3) participants’ specific goals. This is particularly evident in composite image creation, where the focus is on targeting specific content elements, reducing the need to define relationships between them. P1 described their process stating, “usually I put different keywords [...] and it’s not perfect, but I go into Photoshop and I clean it up.” Referring to an example composite image they created, P1 explained selectively piecing together elements stating "this head with this picture" from separate images into a new composition, as opposed to constructing the full scene solely via prompting.
Moreover, this structure facilitated collaborative prompting, also known as ’prompt jamming’ (P6), where multiple users contribute to building the prompt. P6 shared an instance where a “series of words was put together by a collective of people.” This ease of contribution, requiring no understanding of complex structures, allowed users to build on existing ideas and prompts, as P9 described “people were just adding words, somebody started with all these 70-millimeter shutter speeds and, we just started adding words to it.”
4.2 Users’ Prompt Journey: Image Evaluation
Evaluating AI-generated images emerged as a standard practice among our participants, though we found minimal background influence on this process. Only four (P10, P12. P14, P17) out of the study’s 19 participants offered prior background influencing their image evaluation. For example, P10 mentioned, "being an artist myself, I know when something doesn’t look composed right or if it looks off or imbalanced," emphasizing the influence of artistic training on composition and balance. Similarly, P12’s background in photography led them to pay specific attention to depth of field, stating, "sometimes I’m going for a shallow depth of field, but most of the time I’m not."
While participants’ motivations and definitions of a successfully generated image varied, we identified two factors that influenced how participants approached image evaluation: the specificity of their objectives and the representational (i.e., "refers to art that represents something with visual references to the real world" [
52]) or non-representational (i.e., "Work that does not depict anything from the real world" [
52]) nature of their intended content. Importantly, these factors were not static. Participants’ goals could range from open-ended exploration to achieving specific outcomes, and their desired level of representation could shift between representational and non-representational based on their immediate needs and creative ideas. This adaptability highlights the dynamic nature of their engagement with Midjourney, where evaluation criteria were not fixed and evolved in tandem with their artistic intentions. Furthermore, we highlight the criteria participants often mentioned in terms of subject matter and visual design considerations to evaluate the generated image. In presenting these criteria, we view them as tools that participants employed to effectively navigate their evaluation approach. We note that the scope of our study includes a broad exploration of these criteria without linking them directly to the specific objectives or content types.
4.2.1 Goal Specification: Exploration vs. composites vs. targeted goals.
Participants’ goals from open-ended exploration to specific end products, influenced their expectations during image evaluation. Participants who mentioned exploration like P9, P10, P11, P13, P14, and P16 displayed lower expectations for evaluation since outputs were not treated as final products. These participants valued surprise from Midjourney, with P13 describing it as "happy accidents." Similarly, P11 found enjoyment in the AI’s unpredictability: "I think it makes it more fun though because then you can see what it just comes up with." Given the open-ended nature of their engagement, these participants emphasized unexpectedness over accuracy to a predetermined image.
Other participants like P1, P6, P9, P13, and P17 utilized outputs as material for their composite works, selecting favorite individual elements from multiple images for a new, cohesive piece. This method allowed them to focus on gathering inspiring details for reassembly, rather than aiming for a complete image from the start. P1 and P17 exemplified this approach in their process, with P1 remarking, "I can still use [it] as a landscape. I plan on using it," and P17 focusing on enhancing their designs: "these beautiful little adornments that can go on top of the dresses." This approach underscores these participants’ focus on component selection and integration, with P1 noting the ability to alter the composition, making it a secondary concern in their creative process.
Finally, some participants like P2, P3, P5, P8, P12, and P15 had very targeted outcomes in mind for either content or stylistic goals that they wanted Midjourney outputs to match. With a precise mental picture driving their prompts, expectations were higher for accuracy to predetermined criteria. For example, P5 detailed a precise vision: “You know, that’s something that if I could say something like stands on the bow of a ship, which is, you know, to the right of the frame and, you know, the character is looking back towards the ship or something. That’s something that I would like.” Such specific expectations often led to disappointment when details were missing, as participants were aiming for exact realizations of their envisioned scenes. Similarly, P2 and P12 focused on stylistic consistency, with P2 stating, "so that is my number one goal is it needs to fit within my color scheme," and P12 aiming to "have something I want essentially in my style."
4.2.2 Content Type: Representational vs. non-representational.
Participants all had examples of representational content. When creating representational content, participants had specific expectations for realism and detailed accuracy. For example, P8, who worked on images of construction vehicles, expressed frustration with an output: "when I look at it, it doesn’t make any sense at all. It’s like Midjourney sort of looks at it from three different perspectives at the same time." P8’s criteria, focusing on "color, composition, and realism," highlighted a desire for outputs that closely resembled real-life objects, with realism being the ability "to get anything that actually looked like a real vehicle." Similarly, when working with fantastical themes, as P3 did with a Viking horse, the emphasis remained on recognizable features. P3’s selection criteria, "but out of all of these horses, only this one really kind of looked like a horse," underscore the expectation that even mythical or imaginative subjects should maintain a degree of recognizable features. This illustrates that, in representational art, participants valued outputs that not only resonated with their vision but also convincingly mirrored the physical characteristics or believably portrayed concepts.
Conversely, in their exploration of non-representational art using Midjourney, six participants mentioned exploring abstract concepts, particularly emotions, with a focus on stylistic elements rather than concrete representations. This approach, shaped by the abstract nature of the concepts, often resulted in a more forgiving evaluation. P15, for example, used song lyrics and poetry as prompts, stating: "I’ll put that into the system to see what the AI visualizes as that feeling." P15’s reflection on the subjective interpretation of abstract concepts, as in "what is it interpret when I put things like virtually painful or exhausting or depression even?" allowed for a wide range of acceptable outcomes. P12’s experience of creating an abstract piece that resembled a cityscape from random inputs showcases how the evaluation in non-representational art often focuses on creative interpretation rather than precise depiction. As explained by P12: "as you can see the result is actually kind of nice, right? You know it’s abstract. Looks like it might be a cityscape, right? These might be like buildings or something."
4.2.3 Evaluation Criteria.
In assessing their AI-generated images, participants employed various criteria, often articulating what they liked or disliked regarding specific outputs. This process involved mapping their prompts with particular attributes in the images, forming a set of evaluation criteria. Broadly, these criteria fell into two categories: the subject matter, concerning "what is in the image," and visual design, focusing on "how it is presented." When evaluating the subject matter within a generated image, participants mentioned: the desired level of realism (n = 11), details within the image (n = 5), and the accuracy of the content (n = 3). Characteristics mentioned by participants were: color (n = 11), composition (n = 7), depth (n = 6), texture (n = 4), symmetry (n = 3), sharpness (n = 3), movement (n = 3), feeling right (n = 3), and coherence (n = 2).
4.3 Users’ Prompt Journey: Prompt Refinement Processes
During iteration, after evaluating the output image, participants performed a diverse set of actions to modify and refine their prompts if they felt the current prompt was generating undesired results. Prompt refinement processes move participants closer to their ideal output image, often performed in steps by incrementally modifying the prompts and evaluating images. A detailed list of these prompt refinement processes can be found in Table
2.
One refinement strategy used by our participants is
adding words to describe the content in more detail compared to the original prompt. During image evaluations, participants paid attention to visual qualities and subject matter in their outputs. Consequently, when specific visual elements or content were lacking in the image, participants elaborated their prompt. P18 shared an example of adding ’dark’ and ’bowl’ to refine their prompt: "
I was trying to be more specific where the wood was darker. So, I said a dark brown wooden flat bowl plate." These additions reflect different prompting structure strategies (section
4.1), from adding chunks or including adjectives within a sentence.
Additionally, some participants (n=7) mentioned stepping back and omitting words as part of their refinement process. As P2 demonstrated one example iteration for us, they realized adding to the prompt was not working, so they changed their strategy: "shrinking it back down as you notice, the more I am adding, it is not helping. It is not making it better. So, I am going to go back down to simplicity." Some participants also used this strategy iteratively to understand the prompt and image relationship. For example, P17 explains that they removed a word to determine whether it contributed to the output image: "so then I said, let me take out the style of the artist and let me see if it [Midjourney] just recognizes ’Googie’ [an architectural style] in general."
Some participants (n=11) mentioned that
word order, repetition, and exaggeration were effective methods to emphasize specific aspects of their prompts or accentuate a theme. Consequently, when the generated image missed or lacked something, they refined their prompt by re-ordering, repeating, or exaggerating words. By rearranging the phrases in the prompt, P2, who was experimenting throughout the interview, came closer to their desired visual qualities in a spider portrait: “
having the descriptors at the front [of the prompt] and spider at the end seems to be making a difference.” While P2’s approach stemmed from their own trial and error, P12 adopted an exaggeration strategy observed from other users: "
exaggerating, and that’s what I find in Midjourney [...] if you really wanted to make sure that the figure ’Bob Ross’ is going to be tall, really tall, and very, very tall or like, keep overstating that and I’ve even seen some people say that technique." In addition to these linguistic strategies, a few participants adjusted the weights as part of their refinement process. Adjusting weights can have effects similar to re-ordering or exaggerating, allowing users to subtly influence the model to focus on specific aspects of a sub-prompt (see section
4.1.4).
The degree of understanding of how parameters influence output images varied among participants. Some participants like P1 demonstrated a clear grasp, explaining how adjusting the aspect ratio impacts their generated image: "if you put a –9:16, that gives you much better portraits...But if I do it in the square format, there’s not enough [room]." Their statements llustrated an awareness of parameter effects. However, other participants, like P8, acknowledged using certain parameters without fully grasping their impact on the generated image: "you know, - -16 by 9, - - quality 25, whatever. I don’t know much difference [it makes] there or if it even really does anything."
Finally, refinement sometimes (n=16) involved leveraging Midjourney’s capability to generate diverse outputs from the same input. This technique, known as re-rolling or making variations, involves participants rerunning the same prompt, capitalizing on Midjourney’s inherent variability. P2 explained this approach, stating, "you can still put in the same prompt and it’s [Midjourney] gonna come up with something totally different." This approach was particularly useful at different refinement stages, with some, like P5, starting to explore variations after achieving a satisfactory base result: "I guess the first thing that I look for is, you know, the accuracy from my perspective of what I was after and then versioning from there" Novice users, in particular, tended to depend more on this method. For example, P3, a novice, utilized re-rolling without modifying their prompt, in hopes of achieving a more favorable outcome akin to "roll[ing] the dice and hop[ing] for a better number." As P3 explained, they used this method because they were unsure how else to refine: "so, I wouldn’t know what to add or if I needed to add pluses or take away the commas to make that work." This highlights how some participants, like P3, often rely on Midjourney’s stochastic nature to explore possibilities, especially when uncertain about how to adjust prompts effectively.
4.4 Users’ Prompt Journey: Challenges
Building upon our initial inquiry, our second research question seeks to examine the challenges users experienced during their prompt journey. We surfaced two themes of challenges influencing the users’ prompt journey: challenges in aligning user intentions and generated outputs, and challenges in mastering prompt crafting knowledge.
4.4.1 Challenges in Aligning User Intentions and Generated Outputs.
One of the challenges participants faced throughout their prompt journey was the misalignment between their intended goals and Midjourney’s interpretation of their prompts. When participants had targeted goals (see section
4.2.1), and a well-defined vision for their desired outcomes, frequently encountered frustration when Midjourney failed to recognize critical elements or aspects of their prompt. This challenge was illustrated by P14’s experience, where their explicit mention of "
flying car" was not translated into the AI-generated result: "
the fact that I said flying did not really get translated into this." Similarly, P11 faced an unintended blend of elements: "
Astronaut was supposed to be with fish, and they blended the two together, which was not what I was looking for."
Some participants (n=5) expected the AI to interpret words and concepts as if it had human-like understanding, leading to surprise and confusion when the AI’s interpretations significantly diverged from human-like comprehension. This mismatch was illustrated by P3’s example: "I did this one yesterday. headless horsemen and see he has a head." P5’s struggle to create a "faceless woman" further exemplified this challenge. Despite trying various descriptors such as “faceless,” “no face,” “blurry face,” and a detailed description excluding facial features, P5 found that Midjourney could not discern these nuances without explicit commands like ’–NO’: “I have no idea why that prompt doesn’t work because it seems like the simplest, most boiled down way to say that.” Moreover, this mismatch in interpretations extended to how different words might encode distinct visual representations. P11’s experience with different outcomes for synonyms like ’raven’ and ’crow’ highlighted how the "AI’s interpretation" of similar words could result in distinct visual representations.
The consequences of this misalignment extended beyond mere frustration. In some cases, particularly when aiming for targeted goals, participants felt compelled to abandon their prompt iteration entirely, as Midjourney did not generate the desired results. P16, expressed their inability to achieve specific images, highlighting a desire for more interactive feedback from Midjourney: “I couldn’t get her face wrapped in golden threads. I’m not sure why. Maybe the thing to do would be to have a feature to say it didn’t work.” P5 and P6 experienced such dead ends, finding themselves in a cycle of prompt refinements without seeing effective changes: “I have no idea why that prompt does not work.” P5 Consequently, some, like P18, resorted to abandoning their initial prompts after numerous unsuccessful attempts: “If I try several times, maybe like ten times, and if the outcome is something that I do not like, I just get rid of it... I just make a new prompt using different words or different concepts.” Despite these challenges, the unpredictability of the AI model became its own reward for some, either stopping their efforts when the output was unexpectedly pleasing or motivating them to explore new creative avenues. For instance, P14 shared an experience where Midjourney led to a new creative direction: "the headdresses that kind of took me in a direction where ohh that’d be kind of cool if it had more of like, a Native American folklore aspect to it."
4.4.2 Challenges in Mastering Prompt Crafting Knowledge.
Our participants were exposed to an overwhelming volume of information to assimilate, many (n=10) mentioned maintaining separate documents for tracking. The volume of information, exemplified by P5’s attempt to copy the entire FAQ before ultimately abandoning the effort, can be overwhelming. As P5 stated, "I found that I was just copying and pasting the entire FAQ. So then I stopped that.” The challenge was compounded by the information’s lack of actionable insights. For example, P5 struggled to apply the ‘stylized’ parameter, despite it being mentioned in the FAQ, highlighting the gap between theoretical knowledge and practical application: “If I couldn’t connect that to what’s happening [in the image], then I didn’t know what to do with that information.”
Given these challenges, most participants (n=15), like P16, observed peers’ work as a learning tool, given its ease of access: “Just other people’s work is pretty much all. I didn’t feel like I had the time or energy [to explore other resources].” As P4 explained: "I learned a lot from their phrasing, from their sequence of terms, use of commas, use of capitalization." Yet, observing and reusing peers’ prompts raised ethical questions for some of our participants (n=5), as P8 described: “I’m going into their prompts and it feels sometimes a little weird like I’m gonna steal your [prompt] and I’m gonna make my stuff with it...and it feels kind of odd… It sounds like learning. So I guess it’s fine.”
While many participants found observing others’ prompts useful, the diversity in strategies led to confusion about how to structure prompts effectively, as P15 expressed: “I’ve seen, some people do an input plus another input and I don’t understand it yet.” This confusion was further compounded by the difficulty in determining the appropriate level of detail and granularity for prompts, with P3 comparing it to: “it’s like cooking without a cookbook.” P16 echoed similar sentiments, expressing confusion over subtle differences in prompt details, like “insanely detailed” vs “very detailed.”
As some participants, like P4, became more adept at navigating Midjourney, they shifted from relying on communal prompts to engaging in more independent explorations: “Now that I understand the structure of them better, I’m more interested in just exploring them for myself... I don’t look at other people’s prompts as much as I used to.” Yet, for (n=7) like P13, prompt writing remained a collaborative and social activity: "I’m interested in that where like people are feeding off of each other’s [prompts]."