research-article

Open access

VIVID: Human-AI Collaborative Authoring of Vicarious Dialogues from Lecture Videos

Authors:

Seulgi Choi,

Hyewon Lee,

Yoonjoo Lee,

Juho KimAuthors Info & Claims

CHI '24: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

Article No.: 277, Pages 1 - 26

https://doi.org/10.1145/3613904.3642867

Published: 11 May 2024 Publication History

All formats PDF

Editorial Notes

The authors have requested minor, non-substantive changes to the VoR and, in accordance with ACM policies, a Corrected VoR was published on July 26, 2024. For reference purposes the VoR may still be accessed via the Supplemental Material section on this page.

Abstract

The lengthy monologue-style online lectures cause learners to lose engagement easily. Designing lectures in a “vicarious dialogue” format can foster learners’ cognitive activities more than monologue-style. However, designing online lectures in a dialogue style catered to the diverse needs of learners is laborious for instructors. We conducted a design workshop with eight educational experts and seven instructors to present key guidelines and the potential use of large language models (LLM) to transform a monologue lecture script into pedagogically meaningful dialogue. Applying these design guidelines, we created VIVID which allows instructors to collaborate with LLMs to design, evaluate, and modify pedagogical dialogues. In a within-subjects study with instructors (N=12), we show that VIVID helped instructors select and revise dialogues efficiently, thereby supporting the authoring of quality dialogues. Our findings demonstrate the potential of LLMs to assist instructors with creating high-quality educational dialogues across various learning stages.

1 Introduction

Online lectures are widely used for conveying knowledge in various learning contexts. Notably, instructors often use them as educational resources in various teaching contexts like flipped learning [36] or supplementary materials [8, 24], which usually take the format of knowledge-transfer-oriented online lectures. However, they usually take the form of a lengthy monologue. This format can cause learners to feel disengaged or quickly lose interest [4], potentially resulting in persistent negative emotions that detrimentally affect learning outcomes [28]. To address the limitations of this lecture format, studies have explored the application of conversational agents (CA) to video-based learning [55, 67, 68, 76]. Many studies used CAs to mimic human tutoring behaviors such as scaffolding [30], and these direct interactions with CA have improved the learning experience of online learners.

Although these studies imply the importance of CAs’ scaffolding mechanisms in online video lecture settings, they have supported mostly the learners who prefer to interact directly with an instructor and peer learners [65, 66]. Yet, for vicarious learners, who prefer to learn from others and actively process the interactions of others, interactions that can be vicariously processed are more beneficial to their learning [65, 66]. To enhance vicarious learners’ experience, systems with multiple CAs that simulate interactions between an instructor and a direct learner [68] have been introduced based on vicarious learning theory [49]. Vicarious learning theory explains the benefits of learning when vicarious learners observe tutoring between an instructor and a direct learner who interact directly with the instructor in a video lecture [2, 11, 12, 49]. The studies found that vicarious learners preferred dialogic lecture videos that incorporate CAs over monologue-style lecture videos, and it resulted in a positive effect on students’ engagement [68]. Therefore, introducing vicarious dialogue into monologue-style lectures can serve as a promising solution to address the limitations of conventional online lectures and satisfy vicarious learners.

However, the current approach has not yet addressed how to create high-quality dialogues that cater to vicarious learners through adaptation or expansion of original lecture contents. It is important to consider the quality of the learning content, such as the level of detail provided by the lecturer, because it can significantly affect a vicarious learner’s cognitive load and engagement. Thus, rather than simply enhancing lectures, we converted the original lecture script into a format that can reduce the cognitive load for vicarious learners and found a pedagogically meaningful format for high-quality dialogue. To do this, we integrated LLM in this conversion process since LLM has been discussed as a feasible way to design vicarious dialogues while reducing the extra effort for instructors [68]. In short, this work aims to alleviate the manual effort of instructors authoring vicarious dialogue and establish a scalable pipeline for designing educationally high-quality dialogues from lectures.

To achieve this goal, we have developed five guidelines for transforming monologic lectures into a vicarious dialogue that can benefit online learners: Dynamic, Academically Productive, Cognitive Adaptable, Purposeful, and Immersive. As an initial step in crafting these guidelines, we conducted an iterative inductive literature analysis to define what constitutes a pedagogically meaningful dialogue. However, most existing literature focused on insights derived from classrooms or intelligent tutoring systems, not video lectures. Furthermore, there is limited research on transforming the content in video lectures into high-quality educational dialogues. Therefore, we conducted a design workshop with eight educational experts and seven secondary school teachers to develop the guidelines to be tailored for a STEM video learning setting.

To facilitate the efficient authoring of video-based vicarious dialogues based on our guidelines, we propose a system, VIVID (VIdeo to VIcarious Dialogue), which allows instructors to design, evaluate, and modify vicarious interactions with video lectures. To empower this system, we propose a collaborative design process between LLM and instructors to generate high-quality vicarious dialogues efficiently. This process consists of three stages, guided by the developed guidelines in the workshop: (1) Initial Generation: After an instructor chooses where to convert in a lecture, LLM configures a direct learner’s understanding level for each concept in the selected section of the lecture and generates initial dialogues. (2) Compare and Selection: Instructors compare and select from multiple generated dialogues, and (3) Refinement: Instructors collaborate with LLM to refine the final dialogue, which will replace a section of the video lecture.

To determine whether VIVID is helpful for instructors to transform monologue lectures into high-quality dialogue lectures, we conducted a within-subjects study with 12 instructors. VIVID helped instructors simulate a direct learner effectively through co-designing with VIVID. Furthermore, instructors found that VIVID is significantly better in monitoring essential considerations (p = 0.04) with an effect size (Cohen’s d) of 0.8 than the Baseline when designing dialogue. To evaluate the pedagogical quality of the authored dialogues designed through VIVID, we also conducted a human evaluation with six secondary instructors in four criteria which is if the dialogue is Dynamic, Academically productive, Immersive, and Correct. We found that the dialogues made by VIVID were significantly better quality in most criteria than the dialogues generated by Baseline.

The contributions of this work are as follows:

•

Design guidelines through design workshop for making vicarious educational dialogues from lecture videos.

•

VIVID, a system that collaborates with LLM to assist instructors in authoring vicarious dialogues from the monologue-styled lecture videos.

•

Findings from a user study with 12 instructors showing how VIVID can assist instructors in dialogue authoring (Section 6.2), and a technical evaluation with six instructors that demonstrates the higher quality of dialogues created by instructors using VIVID compared to the Baseline (Section 6.4).

2 Related Work

We reviewed previous research on simulating vicarious learning in online learning contexts and approaches for generating diverse educational dialogues at scale.

2.1 Simulating Vicarious Learning in an Online Learning Environment.

Vicarious learning [11, 19] in an online environment typically occurs when observing the interaction between other learners and an instructor on platforms like Zoom or when witnessing peer discussions on QA platforms. Such situations of vicarious learning can stimulate learners’ cognitive activity and enhance their level of engagement.

Thus, research has employed a Conversational Agent (CA) [25, 31, 37, 60, 72] to simulate interactions between a virtual tutor and tutee for supporting vicarious learners in video-based learning. For instance, Nugraha et al. [55] explored how a CA in the role of a tutee to Massive Open Online Course (MOOC) videos could enhance the vicarious learners’ learning experiences. Similarly, Tanprasert et al. [67] implemented vicarious interaction in MOOCs as if participating in a Zoom class. To do this, they added scripted vicarious dialogues between virtual learners and an instructor to a lecture video in a chat format. These studies showed learners preferred dialogue-like lecture videos with CAs that mimic vicarious interactions over monologue lecture videos. This type of interpersonal interaction positively impacted vicarious learners’ engagement.

However, it’s important to note that these studies employed manually crafted dialogues of assumed equal quality even though the quality of dialogues can significantly influence learner engagement and outcomes [57]. Furthermore, there is limited research on designing high-quality educational dialogues to facilitate vicarious learning in video-based learning contexts. Consequently, we aim to fill this research gap by developing guidelines for creating high-quality educational dialogues that can promote effective vicarious learning experiences [68].

2.2 Generating Diverse Educational Dialogues for Vicarious Learners at Scale

Large Language Models (LLMs) are becoming increasingly useful for educators [48, 73]. One promising area of research involves utilizing them to create a wide range of educational materials [18, 61]. For example, Wang et al. [74] found that large pretrained language models (PLMs) can automate the generation of educational assessment questions. Other approaches introduce question generation models that automatically produce questions from educational content such as textbooks [74, 75].

However, they primarily focus on addressing the challenge of scaling the generation of specific question types and provide solutions primarily at the model level without considering the needs of instructors and learners. In contrast, Promptiverse [38] proposes a novel approach aimed at reducing the workload for instructors while delivering useful and diverse prompts to learners. Furthermore, ReadingQuizMaker [45] introduces a system to enable instructors to conveniently generate high-quality questions. Both systems allow instructors to create prompts or questions at scale, but neither considers the learners’ level when generating them. Furthermore, they mainly focus on enhancing the diversity of single prompts or quizzes. Consequently, applying these approaches to generating diverse educational dialogues, which involve dynamic interactions between tutees and a tutor, may present challenges.

To evaluate various uses of LLM in generating learning materials, such as code explanations [40], learning objectives [62], they have been evaluated based on general criteria, such as “easy to understand” or “accuracy” without thoroughly considering the quality for specific tasks. However, to ensure quality, it is essential to establish specific and measurable criteria tailored to each task. Moreover, integrating LLM into education practice requires balancing the use of LLM with the role of instructors since relying solely on an automatic pipeline with LLM may result in low quality. Thus, we aim to establish criteria for assessing education dialogues and propose an LLM-based pipeline that can generate high-quality dialogues scalable while considering vicarious learners. Further, based on this pipeline, we aim to design an interactive system that allows collaboration between LLM and instructors in authoring dialogues.

3 Design Workshop

To develop a guideline for designing high-quality vicarious dialogues, we employed the two-step approach. In the first step, we conducted an iterative inductive literature analysis to define what a pedagogically meaningful dialogue should look like. Despite the increasing amount of research on video learning, there has been little research on how to design beneficial vicarious dialogues based on lecture videos and how to support instructors in doing this. To address these issues, we conducted a design workshop to develop new guidelines for designing vicarious dialogues in the context of video-based learning.

3.1 Utterance Patterns and Teaching Strategies

Two of the authors conducted an iterative inductive analysis of literature to define what constitutes a pedagogically meaningful dialogue in literature. To identify relevant literature, we conducted a query-based search with the PRISMA process [51] on Google Scholar and the ACM Digital Library, and the 50 final papers were selected for meta-analysis. Our review was based on three search queries related to main keywords (Detailed analysis method is in the Supplemental Material):

•

Vicarious Learning: "vicarious learning" + ("learning gain" OR "tutorial dialogue" OR "monologue")

•

Classroom Interaction: "classroom interaction" + "science" + "dialogic" + "teacher questioning" + ("secondary school" OR "undergraduate")

•

Human Tutoring: "human tutoring" + "tutorial dialogue" + ("strategy" OR "move") +("scaffolding" OR "feedback")

Based on our literature analysis, we created initial guidelines for designing vicarious dialogue in video lectures. The vicarious dialogue should be perceived as a natural conversation occurring during a lecture and should be effective for vicarious learners. Thus, we derived two main factors for designing vicarious dialogues: (1) the most commonly observed utterance categories in real tutoring (Table 1, Table 2) and (2) effective teaching strategies for vicarious learners.

3.1.1 Key utterance categories that are commonly observed in 1-to-1 tutoring and classroom.

Several studies have collected data from actual one-on-one tutoring or classroom session recordings and performed qualitative coding at the statement level to classify representative types of utterances made by tutors and tutees. We categorized the tutor’s utterances into nine types (Table 1) and the learner’s utterances into five types (Table 2) to utilize for designing vicarious dialogues that simulate a natural tutoring scenario.

3.1.2 Three teaching strategies that can positively affect vicarious learners.

Research indicates that vicarious learners are notably affected by the direct learner’s discourse following the instructor’s statements as vicarious learners tend to mimic direct learner’s actions [11]. The three most influential dialogue patterns in vicarious learning include:

Integrate a direct learner’s cognitive conflict. A tutoring video that contains a cognitive conflict situation, where the instructor corrects errors made by the direct learner, can improve the attention and interest of vicarious learners [11, 19, 54]. Thus, we propose designing dialogues as if the instructor encourages the direct learner to reach confusion and addresses misconceptions productively [29, 39, 70].

Integrate a direct learner’s deep-level reasoning questions. We suggest incorporating deep-level reasoning questions [16, 17, 21, 22, 27] that address comparisons, inferences, and causal relationships among concepts into the direct learner’s utterances. According to previous research in Intelligent Tutoring System (ITS) [16, 17, 21, 22, 27], the vicarious learners’ learning was significantly improved when the direct learner posed deep questions. Therefore, if a direct learner asks a deep-level reasoning question during a lecture, it can encourage vicarious learners to engage in critical thinking.

Integrate a direct learner’s substantial and relevant follow-up responses. A direct learner should provide answers or self-explanations based on the learning contents followed by an instructor’s scaffolding or lecturing [11, 12, 16, 23].

Table 1:

Tutor’s utterance	Definition	Tutor’s utterance	Definition
Self-monitoring	Utterance related to self-monitoring of one’s teaching style [13].	Summarizing	Utterance that summarizes what has bee done so far or restates student’s questions/statements/ comments [3, 10, 34, 42, 44, 46, 56]
Lecturing	Utterance explaining declarative knowledge, which includes facts and conceptual principles. [6, 9, 10, 13, 34, 44]	Answering	Utterance in response to student questions [3, 13, 42, 44, 69].
Demonstrating	Utterance related to solving specific problems in a way that allows student to model the instructor’s problem-solving approach [10, 44].	Scaffolding	Utterance involving assistance or hints to help students reach answers on their own. [3, 5, 6, 7, 9, 10, 13, 14, 20, 32, 34, 42, 44, 46, 50, 52, 53, 56]
Questioning	Utterance containing questions to encourage students to recall knowledge or think productively (e.g., deep-level reasoning/short-answer questions [3, 5, 6, 9, 13, 14, 32, 34, 41, 42, 46, 52, 56, 59, 69, 71].	Diagnosing	Utterance used to diagnose student understanding or progress [6, 13, 44].
Off-topic	Introduction or unrelated utterances, such as small talk not related to learning [10, 42, 44, 69].

Table 1: Table 1: Nine categories of tutor utterances and their corresponding definitions.

Table 2:

Tutee’s utterance	Definition
Questioning	Utterance related to posing cognitive deep questions or simple questions to the instructor [13, 44].
Answering	Utterance related to providing responses or completing scaffolding in response to a instructor’s question [13, 42, 44].
Reflecting	Utterance related to assessing one’s understanding level in response to a instructor’s question or voluntarily [13, 44].
Explanation	Utterance related to speaking spontaneously, as if articulating one’s thoughts simultaneously, without necessarily being prompted by the instructor’s scaffolding [13, 44].
Off-topic	Introduction or unrelated utterances, such as small talk not related to learning [42, 44].

Table 2: Five categories of tutor utterances and their definitions.

3.2 Workshop Overview

To verify and improve the literature-based guidelines for vicarious interactions in video learning environments, we conducted design workshops with eight educational experts (7 female, 1 male) and seven secondary school teachers (5 female, 2 male). We aimed to (1) derive design guidelines for effective conversion of monologue-style lecture videos into dialogue-style videos and (2) discover design opportunities for a system that can facilitate easy authoring of dialogue-style lectures with LLM. We mainly target STEM lectures in our workshop. STEM lectures can cause more intrinsic cognitive load, require more critical thinking than other subjects, and be prone to disengagement while watching lectures because they mostly consist of abstract concepts and complex formulas [63, 64]. Thus, we decided to present STEM lectures in a dialogue format as it might help with processing the dense knowledge of STEM lectures.

4 Findings from Design Workshop

We identified the two most commonly mentioned issues by participants and formulated five design recommendations for creating high-quality vicarious dialogues. Additionally, we propose how LLM can be integrated into the educational dialogue authoring process.

4.1 Challenges in Converting Video Lectures to Dialogue

Two challenges were observed when instructors converted video lectures to dialogue.

Challenge 1: Designing the overall structure of dialogues. We observed that the participants faced difficulties in designing the overall structure of the dialogue when creating from scratch. Participants mostly first struggled with which part of the lecture should be converted to dialogue. P3 mentioned that it was “difficult to figure out which parts of a monologue should be transformed into direct learner’s questions" and P4 said it was “hard to decide when and how much dialogue to create". It poses the cold start problem when designing dialogues by considering the improvement of the vicarious learner’s learning. Furthermore, participants struggled to determine the appropriate format for the dialogue as they were unsure how the dialogue format would affect learning outcomes. P5 said that “while it was easy to convert the lecture into a simple question-and-answer format, I’m not sure if these would be meaningful dialogues for vicarious learners". P15 also mentioned, “If it ends up looking too similar to the original lecture format, converting the material to a dialogue format might not be necessary", asserting the need to define what kind of dialogue format would be helpful for vicarious learners in an online learning environment.

Challenge 2: Anticipating direct learner’s utterances based on their level of understanding. Both instructors and experts needed help with designing a direct learner’s utterances. This is evident from comments: “It is hard to add direct learners’ misconceptions to dialogues effectively" (P15) and “It was difficult to consider individual responses of the direct learners" (P7).

4.2 Design recommendations that should be considered while designing dialogue for vicarious learners.

Based on the challenges above, we propose five dialogue design recommendations. Furthermore, we suggest four teaching strategies (Table 7 in Appendix) validated by workshop participants as likely effective even in a video-based learning context among pre-defined guidelines based on literature (Section 3.1.2).

DR1. Dynamic: Include various interaction patterns to reflect the dialogic dynamics between the tutor and tutee. A vicarious dialogue should be structured with fast turn-taking and various utterance patterns (Table 1, Table 2) that capture the dynamism of an actual tutoring scenario. Moreover, P14 mentioned that "fast turn-taking is required to hold the attention of vicarious learners in online education, as it is more difficult to retain focus on digital learning platforms than in physical classrooms". Furthermore, instructors and experts often divided the tutor’s lengthy utterances into smaller sub-dialogues between the tutor and the direct learner, highlighting the quick turn-taking in vicarious dialogues.

DR2. Academically productive: Encourage the metacognitive and constructive utterances of the direct learner to make a dialogue academically productive. Direct learners’ utterances should be pedagogically meaningful to enhance vicarious learners’ learning and engagement. Most workshop participants consistently emphasized the influence of direct learners on vicarious learners throughout the dialogue design process. Notably, they stressed the importance of direct learners displaying "interactive engagement" in dialogues, as vicarious learners are highly likely to empathize with the direct learner’s learning process. The term "interactive engagement" refers to the active engagement of direct learners both cognitively and metacognitively.

Direct learner’s cognitive engagement: P15 highlighted the importance of a tutor in a vicarious dialogue who should encourage active engagement by facilitating connections between direct learners’ existing knowledge and the new material, citing Ausubel’s meaningful learning theory [33]. In addition, P14 mentioned that “When the instructor links the learning contents with the learner’s personal experiences, the transfer learning occurs more easily".

Direct learner’s metacognitive engagement: P15 and P9 proposed incorporating self-assessment and explanations of understanding from the direct learner into vicarious dialogues: “When a direct learner self-assesses their level of understanding or performs self-summarization, a vicarious learner could potentially check their comprehension". In addition, P15 suggested that a tutor continuously promotes the direct learner’s metacognition. This guide aligns with the findings that in an ITS [1, 47], the constructive actions of a direct learner, such as answering based on what they learned from the instructor’s scaffolding and asking deep-level reasoning questions [35, 47], significantly influenced the learning outcomes and participation of vicarious learners.

DR3. Cognitively adaptive: Adapt the teaching strategies to the level of understanding of the vicarious learner, learning objectives, and lecture contents. Previous literature suggests that strategies requiring higher cognitive engagement, like inducing cognitive conflicts and posing deep-level reasoning questions, benefit vicarious learners [16, 17, 21, 27]. However, applying cognitively demanding strategies, like cognitive conflict in Table 7 in Appendix, may not always suit all learning materials or learners when converting lecture videos into dialogues. P15 noted that the choice of cognitive strategy may vary depending on the granularity of the learning content being transformed into a dialogue. In addition, he emphasized the importance of aligning cognitive strategies with learning objectives and the level of vicarious learners, stating that “Frequent placement of lighter, easily answerable questions and minimal use of cognitive strategies on important content could lower the cognitive load on vicarious learners".

DR4. Purposeful: Define a learning objective for the vicarious learner and ensure that the learning objective is achieved through that dialogue. To create meaningful dialogue for vicarious learners, we recommend aligning the dialogue’s goal with the vicarious learner’s learning objective and illustrating the achievement of this objective through interactions between a direct learner and a tutor. P15 and P8 emphasized the importance of defining clear learning objectives for vicarious learners as an initial step in dialogue creation. Additionally, P8 highlighted that learning objectives should be intimately tied to the difficulties vicarious learners face.

DR5. Immersive: Utilize realistic teaching scenarios and match the direct learner’s cognitive level with the vicarious learner’s level. We suggest considering two factors that can immerse vicarious learners in their vicarious interaction.

•

Incorporate common teaching scenarios: Some participants suggested using real classroom scenarios for vicarious learner engagement. For example, P11 proposed scenarios in which the direct learner is given an incorrect problem and asked to explain what is wrong and a situation where another learner responds correctly to the tutor’s question when a student gives wrong answers. P14 also suggested a scenario where a tutor makes the direct learner apply what they have learned in different examples.

•

Match cognitive levels: Instructors and experts highlighted aligning the cognitive levels of direct and vicarious learners in lecture videos to benefit the vicarious learners.— “Vicarious learners often lose interest when confronted with familiar material but are more likely to engage when unfamiliar or essential information is presented.” (P12). Therefore, addressing vicarious learners’ unfamiliar or challenging parts through direct learners’ dialogue could be an effective way to design meaningful and high-quality dialogue.

4.3 Enhancing the Educational Dialogue Design Process with LLMs

After establishing guidelines, we explored how instructors and experts used LLM-generated dialogues and developed evaluation criteria (Table 5) for evaluating their pedagogical quality based on how workshop participants assess the dialogues (Table 3). We also explored strategies for integrating LLMs into the educational dialogue design process.

Table 3:

Criteria	Key questions
Dynamic	Are various interaction patterns (Table 1, Table 2) incorporated to reflect the dynamics of real classroom dialogue?
Academic Productivity	Is the teacher effectively eliciting the learner’s metacognitive and constructive utterances to ensure the discourse is academically productive?
Cognitive Adaptability	Are the cognitive strategies used in the dialogue adaptively applied based on the vicarious learner’s level, learning objectives, and the lecture contents?
Purposefulness	Is the learning objective of vicarious learners achieved through the dialogue between the direct learner and teacher?
Immersion	Does the dialogue represent realistic teaching scenarios and establish a direct learner’s level comparable to that of a vicarious learner, thereby improving vicarious learner engagement?
Usefulness	Is the dialogue satisfactory and useful, considering personal experience with students, what an instructor wants to emphasize, and the instructor’s usage context, such as the level of vicarious learners being targeted?
Correctness	Are domain-specific words used accurately, and is the conversation content based on facts?

Table 3: Criteria when instructors evaluated the pedagogical quality of LLM-generated dialogues in our design workshop.

4.3.1 Utilization of LLM-Generated Dialogues.

We propose two ways in which the LLM could enhance the dialogue design process for vicarious learners. Firstly, it can provide pre-generated dialogues, stimulating instructors’ ideation. P2 commented that using the LLM felt like it provided helpful guidelines, making it more effective than starting from scratch. Secondly, it can assist in modifying dialogues at different levels, refining sub-dialogues and crafting direct learners’ responses. Participants proposed presenting expected responses at different levels (P12) and automating the process of generating questions from the direct learner’s perspective (P2).

Despite the LLM’s advantages, the dialogue authoring process still requires active instructor involvement. In our observation, we have noted that instructors have their own set of criteria when designing high-quality dialogues. These criteria are based on their teaching experiences and can vary depending on the instructor’s emphasis on specific aspects where they believe vicarious learners may face challenges. Guided by these personalized criteria, instructors designed and revised their dialogues.

Some instructors found the generated dialogues satisfactory because they aligned with their intended teaching points or teaching style. P13 chose the dialogue, stating “When teaching math, using fewer variables is better. So, I initially emphasized reducing the number of characters and utilizing known information. The dialogue aligns well with my problem-solving approach that focuses on minimizing variables". Some instructors didn’t use the dialogues because the content didn’t meet their quality criteria. For example, P11 made revisions to emphasize a specific point, stating, “The tutee’s question: ’So, is x-2 the square root of 6?’ is crucial in the problem-solving process. It would be helpful if the tutor followed up with a question like, ’What is the number that becomes 6 when squared?’ to elaborate on this point".

4.3.2 Criteria for Evaluating the Educational Dialogues.

Instructors evaluated the quality of LLM-generated dialogue based on seven criteria (Table 3). Five of these criteria aligned with the key factors to consider when designing educational dialogues (Section 4.2), while the other two criteria, Usefulness and Correctness, pertain to evaluating dialogues generated by the LLM.

4.4 Design Goals

Based on LLM’s strengths and limitations in designing educational dialogue and criteria that instructors emphasized the most when evaluating the quality of dialogues (Table 3), we propose four design goals (DG):

DG1. Enable instructors to easily simulate direct learners easily.

DG2. Assist instructors in designing dialogues by referencing utterances generated at various levels of granularity.

DG3. Assist instructors in creating dialogues that reflect the user’s dialogue usage context and personal experience with students.

DG4. Ensure that instructors consistently monitor important considerations when designing vicarious dialogues.

5 VIVID: A System for Authoring Vicarious Dialogues from Monologue-styled Lecture Videos with LLM Assistance

Based on our design goals from the workshop, we developed VIVID, an LLM-based system to assist instructors in crafting vicarious dialogues from their monologue-styled lecture videos. While LLM holds potential benefits for the dialogue design process, as detailed in Section 4.3.1, they may not be practically utilized in real educational settings if Correctness and Usefulness (Table 3) are not ensured. Thus, VIVID provides a collaborative authoring process between LLM and instructors, facilitating the generation of high-quality and correct vicarious dialogues. Based on our four design goals and observed dialogue design process in the workshop, this collaborative authoring process consists of three stages: (1) Initial Generation, (2) Comparison and Selection, and (3) Refinement.

Figure 1:

To motivate VIVID’s design, we describe a usage scenario where an instructor collaborates with LLM to author dialog through VIVID. A high school biology teacher, Sophia requires her students to watch recorded lectures before class. Sophia wants to make sure that students easily understand parts of the lectures with the most common misconceptions. In this context, she uses VIVID to transform the sections in her recorded lecture where misconceptions frequently occur into dialogues so that her students gain a better understanding. Thus, she uploads her lecture video to VIVID (A1, Figure 1) and selects the sections she wants to transform into dialogues (A2).

Initial Generation. She then highlights areas where her students might develop misconceptions or key examples she wants to emphasize in the dialogue (B1, Figure 1). Sophia aims to design the dialogue scenario as if it is occurring in a high school biology class, where a teacher addresses the direct learner’s misconceptions in the dialogue (B2). Upon highlighting, VIVID generates four dialogues reflecting the dialogue scenario.

Comparison and Selection. VIVID shows generated dialogues with an ‘understanding level rubric’ (C1, Figure 1) that shows four levels of learners’ understanding for each key concept in the selected part and the ‘dialogue cards (C2)’ that contains key information of each dialogue. Sophia compares each dialogue, considering the knowledge levels of the direct learner for each concept illustrated in the dialogue cards (C2). She then chooses to modify ‘Dialogue 2’ because it highlights the misconceptions she wants to include.

Figure 2:

Refinement. Sophia modifies ‘Dialogue 2’ by adding questions in the tutor’s utterance to address direct learner’s misconceptions. She clicks the Generate button (D1-A, Figure 2) to add a new utterance. However, she is unsure what answers the direct learner could provide for these newly added questions. To view different examples of how the learner might respond, she first selects the learner’s utterance that she wants to see more variations of clicked sub-dialogue (D2). Afterward, she clicks the Laboratory button (D4-1), and VIVID generates four variations of the chosen utterances.

After reviewing the results, she wants to replace the existing utterances with new ones that better represent the learner’s misconceptions. She clicks the Apply button (D4-2) to replace the previous utterances with new ones. This allows Sophia to create a dialogue where misconceptions are effectively addressed in the final dialogue.

Figure 3:

5.1 Initial Generation

VIVID initially creates various dialogues for instructors to choose the one that aligns best with their intention for converting monologue to dialogue as we found that the LLM-generated dialogues can be utilized as prototypes in the process of educational dialog design (Section 4.3.1). Notably, our LLM-based pipeline of the Initial Generation stage is designed to generate dialogues that satisfy the most emphasized characteristics by workshop participants, which are Dynamic, Academically Productive, and Immersive (DR1, DR2, and DR5 in Section 4.2). Furthermore, when generating dialogues, VIVID reflects instructors’ needs in our pipeline, making instructors easily simulate direct learners with knowledge levels similar to their target vicarious learner (DG1 in Section 4.4). Thus, the Initial Generation stage consists of four steps to generate dialogues that finely adjust the direct learner’s knowledge state based on the instructor’s needs. We determined our final prompts (further details are in the Supplemental Material) by evaluating the quality of various dialogues based on our evaluation criteria (Table 5).

5.1.1 Step 1. Create a rubric for highlighted areas, indicating the learner’s understanding level for each concept.

DR5 (Immersive) in Section 4.2 suggests that the dialogue should align the cognitive level of direct learners with vicarious learners. Highlighting feature allows instructors to highlight sections in the script that vicarious learners might find challenging. It reflects the intention of instructors to convert the dialogue for a specific level of vicarious learners. Therefore, VIVID leverages the highlighted sections to make assumptions about the level of vicarious learners and uses it to model the direct learner (DR5 in Section 4.2).

Before configuring the direct learners’ understanding state, we extract the core concepts of the selected area in the transcript and divide the direct learners’ possible understanding state of each concept into four levels. These levels are based on the cognitive domain of Bloom’s taxonomy [26] as it has been used by instructors to design, assess, and evaluate student’s learning [43]. VIVID then generates four understanding levels for each key concept with LLM and presents them in a rubric format (B1) (Figure 1). The understanding level here refers to the understanding state expected of direct learners when they learn new concepts from the instructor during the dialogue.

5.1.2 Step 2. Determine the direct learner’s understanding level using the highlighted parts and the rubric.

The highlighted parts present the concepts that the direct learner may not fully comprehend after the tutor’s explanation in the dialogue. We set the direct learner’s understanding level based on the highlighted concepts, using ‘level 1’, ‘level 2’, or ‘level 3’ in the generated rubric to indicate the direct learner’s knowledge deficits. The direct learner is prompted at the highest understanding level, ‘level 4’ for unhighlighted areas.

Figure 4:

The process of determining a direct learner’s understanding level didn’t consider prerequisite relationships between concepts to generate a dialogue that reflects varied levels of comprehension of each concept, as shown in Figure 4. For example, consider a case where Concept A is a prerequisite for Concept B. Even if the LLM model sets Concept A at ‘level 1’ and Concept B at ‘level 4’, a scenario can be designed where the learner studies Concept A with the teacher to fill the knowledge gap (level 1) and then responds well to Concept B (level 4).

5.1.3 Step 3. Create an answer sheet consisting of the learner’s expected answers to the tutor’s questions and questions showing where the learner struggles.

We designed our prompt to create expected questions and responses to the instructor’s questions when the direct learner is in a specific knowledge deficit state. The expected answer sheet was designed in a descriptive format to reflect the learner’s nuanced understanding. We prompted an LLM to manipulate the expected answers to the instructor’s questions concerning the learner’s knowledge level for each concept. We also designed a prompt to generate questions that direct learners might struggle with the concepts set to a low level.

5.1.4 Step 4. Generate dialogues.

The final dialogues are generated through prompts based on the following three elements as shown in Figure 3: (1) Adjusted direct learner’s knowledge state information through Step 1 to Step 3 to achieve Immersive (DR5), (2) Key utterance categories of a tutor and a tutee in Table 2 and Table 1 to achieve Dynamic (DR1), and (3) Key teaching strategies described in Table 7 to achieve Academic Productive (DR2).

Figure 5:

5.2 Comparison and Selection

In Comparison and Selection stage, VIVID provides the instructors with an Understanding level rubric (B1) and Dialogue cards (B2) (Figure 1) to enable monitoring and selecting based on the criteria that were important during Initial Generation stage (DG4 in Section 4.4). Each dialog card (B2) contains the primary information of the dialogue, such as the direct learner’s understanding level of each concept, key teaching strategies, and key dialogue patterns. Besides, Understanding level rubric represents a four-level understanding state for each key concept appearing in the selected part in the transcript.

5.3 Refinement

5.3.1 Basic tools for instructor’s direct refinement.

In the workshop, we observed that instructors were proficient in using existing dialogue content, like breaking down lengthy tutor utterances into smaller segments or incorporating script contents into dialogue. To facilitate this kind of authoring, VIVID provides four basic functions: add (D1-a), duplicate (D1-b), delete utterance (D1-c), and change speaker (D1-d). As visible in (D1), each utterance box in the final dialogue is clickable and can be moved with drag-and-drop (Figure 2). Additionally, we aimed to enhance the Correctness of the dialogue through direct refinement.

5.3.2 LLM-based refinement tool: Laboratory.

In addition to basic functions, VIVID offers the Laboratory tool (D4-1) that provides alternatives (D3) for the selected sub-dialogues (D2) through LLM (Figure 2). It is designed to address the instructor’s challenges in developing direct learners’ utterances while considering their understanding level (Challenge 2 in Section 4.1) and achieve DG3 (Section 4.4). To do this, we designed the prompt used in Laboratory tool while maintaining four key elements except for the original dialog patterns (in Supplemental Material): (1) learner’s level of the selected dialogue in the Comparison and Selection phase, (2) dialogue context, (3) main learning contents, and (4) the number of turns. On the other hand, we diversified the dialogue patterns, reflecting utterance categories in Table 2 and Table 1 in our prompt. When the instructor clicks the Apply button (D4-2), the selected sub-dialogue (D2) is replaced with the new sub-dialog (D4-2).

5.4 Implementation

VIVID is implemented using React ¹, connected to a Flask ²-based back-end server that utilizes GPT API. Whisper [58], an automatic speech recognition model by OpenAI, auto-generated the script of the section that the instructor chose from the lecture video (B1 in Figure 1). To address limitations in text-to-speech (TTS) models like noise or language and get more precise dialogue conversion, VIVID allows instructors to modify the TTS output directly during the Initial Generation stage.

Subsequently, the system harnessed the API of the latest trained GPT-4, OpenAI’s advanced language model, to generate the rubric, learner’s knowledge level, predicted answer sheet, and the final dialogue. Considering the importance of model accuracy in an educational context, we conducted prompt engineering experiments using GPT-3.5 and GPT-4. We chose to use GPT-4 due to its superior generation quality. We set a temperature of 0.65 for the rubric generation, which was empirically determined through trial to maintain consistency, and used the default temperature for other features.

6 Evaluation

To evaluate the performance of VIVID in designing high-quality educational dialogues, we conducted a two-fold evaluation — user study and technical evaluation. In this section, we provide the details of each evaluation and results, respectively.

Table 4:

ID	Gender	Age	Career	Subject taught
P1	F	50s	30 years	Math
P2	M	20s	2 years	Science
P3	F	20s	1 year	Science
P4	M	40s	15 years	Math
P5	M	30s	7 years	Engineering
P6	M	20s	2 years	Math
P7	M	20s	4 years	Engineering
P8	M	20s	5 years	Math
P9	F	20s	4 years	Math
P10	F	20s	2 years	Science
P11	M	20s	Graduated teacher’s college	Math
P12	F	20s	1 year	Math & Science

Table 4: User study participants’ demographic, career, and their subject taught in the classroom.

6.1 User Study

VIVID is designed to autonomously generate Dynamic, Academically Productive, and Immersive dialogues between a tutor and a direct learner and support instructors in efficiently modifying them. To validate the efficacy of VIVID, we conducted a within-subjects experiment with 12 participants, comparing it with the baseline system that lacks VIVID’s core features.

6.1.1 Study Setup .

Participants were asked to transform a part of the lecture video chosen by the authors, into a dialogue using the systems under each condition. Participants experienced both conditions with different videos in a counterbalanced order to prevent bias and ensure validity. We analyzed user behavior logs, post-survey, and interview data to understand how our system supported the authoring process.

Baseline Condition The following text describes how the Baseline system differs from the VIVID system regarding the four design goals. In the Initial Generation phase of the Baseline, it utilized a simple prompt (the detailed prompt is in the Supplemental) to create a dialogue that did not reflect the learner’s understanding. Thus, the entire process of adjusting direct learner’s knowledge through Highlighting feature (in Section 5.1.1) was excluded. During the Compare and Selection stage of the Baseline, the summarized card function and understanding level rubric were excluded from VIVID, enabling compare and selection of one out of four dialogues for revision without any background information about the generated dialogues. In the Refinement phase of the Baseline, the laboratory function, which offers multiple contextual alternatives for the sub-dialogue selected by the instructor, was removed.

Lecture Selection The clarity of the lecture video can have an impact on the quality of the resulting dialogue. Other factors, such as the length of the video, the difficulty of the content, and the subject matter, can also influence the dialogue creation process. Therefore, when selecting lecture videos, we carefully considered the lecturer’s explanation style and balanced the educational content and level of difficulty across all conditions. All videos were aimed at secondary school students, and we chose lecture content with similar prerequisite levels and granularity. Each video was in Korean and was approximately 10 minutes in length.

As we targeted STEM subjects, we selected two science and two mathematics lectures to use: topics for the science lecture were generation of waves ³ and the refraction of waves ⁴, and topics for mathematics were exponential function ⁵ and logarithmic function ⁶. Mathematics lectures are presented in the format of writing board screencasts with voice-over [15]. Science lectures are presented in the same format but based on slides. Each video follows a monologue-style lecture, where the instructor teaches without direct learners. The audio recording quality of all videos is at a level where the instructor can watch the lectures without any issues.

Participants We recruited 12 participants via social media platforms, including the local community for instructors. The participants were required to 1) teach STEM subjects, 2) have experience in designing online lectures or using them in their classes, and 3) be either school teachers or part-time instructors. We recruited participants for VIVID without considering teachers’ experience levels, as VIVID is designed to support teachers regardless of their experience. All sessions were carried out via Zoom, and participants were compensated at a rate of 45,000 won per hour (equivalent to 34 USD).

6.1.2 Study Procedure .

The study consisted of three tasks, followed by a post-task survey and interview.

Task 1. Eliciting ambiguous intent for the direct learner design. The participants were asked to convert a challenging section of a lecture into a dialogue that would help vicarious learners better understand the topic. In VIVID condition, the instructors had to select the specific contents that might be difficult for vicarious learners and convert them into dialogue using the highlighting feature. The specific guidelines on how the highlighting feature would affect the dialogue generation pipeline were not provided. On the other hand, instructors were only asked to choose where to convert without the highlighting feature in the Baseline condition. They then wrote about the teaching scenarios they wanted to depict in a dialogue.

Task 2. Comparing and selecting a dialogue to revise. Participants in the VIVID condition referred to dialogue cards and rubric to select one dialogue from four generated in Initial generation stage for revision. However, in the Baseline condition, instructors had to choose a dialogue that was designed without considering the direct learner, and they could not consider rubrics and information regarding the direct learner in choosing a dialogue.

Task 3. Revising a chosen dialogue. Participants in both conditions could refine the selected dialogue, employing the system’s basic refinement functions. In the VIVID condition, participants could use the laboratory feature (Section 5.3.2) to refine their dialogue.

Post-task survey and interview After completing the tasks with both conditions, participants were asked to fill out a 7-point Likert Scale questionnaire that consists of nine questions to evaluate whether each feature of the system under each condition well reflected the design goals in Section 4.4 for creating quality educational dialogue and whether it produced quality dialogue (Figure 6). We conducted a semi-structured interview to understand participants’ experiences with each system, the generated conversation, and the dialogue authoring experiences.

Figure 6:

6.2 User Study Results

Despite the overall high utility of the Baseline (Figure 6), nine out of 12 participants found VIVID to be better for designing vicarious dialogues due to its unique features such as rubric, dialogue card, and laboratory features. Notably, instructors considered VIVID to be significantly more helpful than the Baseline in monitoring important factors in dialogue design, as shown in Q9 of Figure 6. However, apart from this, no other significant differences in usefulness were observed.

6.2.1 VIVID helped participants monitor essential considerations when designing conversations.

Participants rated VIVID (M = 6.1, SD = 0.9) as significantly more useful in assisting them in monitoring key considerations persistently in dialogue design (Q9 in Figure 6) compared to the Baseline (M = 5.2, SD = 1.3, p = 0.04, Wilcoxon signed-rank test). Furthermore, while instructors felt that VIVID (M = 5.5, SD = 1.31) was more useful than the Baseline (M = 4.75, SD = 1.13) in considering specific teaching scenarios when designing dialogues (Q7), the difference was not statistically significant (p = 0.07, Wilcoxon signed-rank test). In terms of satisfaction with dialogue quality (Q8), there was a minimal difference between VIVID (M = 5.7, SD = 0.94) and the Baseline (M = 5.6, SD = 0.95). Although VIVID played a significant role in managing the educational dialogue design process, both conditions resulted in similar satisfaction levels due to manual refinement.

6.2.2 VIVID helped instructors simulate a direct learner with diverse levels of understanding.

Although the difference in Q2 (Figure 6), which evaluates how helpful the initially generated dialogue was in considering learners of various knowledge levels, was not significant, VIVID (M = 5.3, SD = 1.6) had a higher average than the Baseline (M = 4.6, SD = 1.56). In addition, some instructors highlighted VIVID was better at selecting a suitable dialogue by considering the direct learner’s knowledge level for each dialogue than Baseline. P1 mentioned, “VIVID was more conducive to constructing a lesson script optimized for the target learner as it clearly indicates the learning stage compared to the Baseline.”. Furthermore, P4 stated, “VIVID was preferable as it allows selection and refinement according to the learner’s level by showing rubric, so it was helpful for selecting dialogues with an appropriate difficulty level.”. Notably, P5 and P11 mentioned that the understanding level rubric provided with the dialogue cards allowed them to consider the direct learner’s level more specifically when choosing a dialogue.

6.2.3 VIVID’s laboratory feature helped instructors better predict the direct learner’s responses and improve the dialogue’s pedagogical quality.

Eight of eleven instructors who used the laboratory feature were satisfied with this feature. One instructor did not use this feature. Some instructors highlighted how this feature positively impacted the dialogue quality. We observed that the laboratory feature helped instructors explore the design space of dialogues while considering possible responses from direct learners. P1 said, “Especially regarding the utterances of direct learners, it was difficult for the participants to imagine what questions the learner would ask, but through this feature, I was able to consider various learning situations and learner’s responses that I hadn’t thought of before.”. P5 also mentioned, “I could consider answers and questions that direct learner might have from a wider range of perspectives”.

Table 5:

Criteria	Statement (7-point Likert Scale)	Measuring questions (Pairwise Comparison)
Dynamic	SD1. The dialogue demonstrates clear and fast turn-taking.	QD1. Which one demonstrates clearer and faster turn-taking?
	SD2. The dialogue utilizes diverse interaction patterns between a tutor and tutees.	QD2. Which one utilizes more diverse interaction patterns between a tutor and a tutee?
Academic Productivity	SAP1. The dialogue encourages the learner’s cognitive engagement (e.g., asking about what they’ve learned, asking various types of questions, and inquiring about a student’s experiences)	QAP1. Which one encourages the learner’s cognitive engagement more? (e.g., asking about what they’ve learned, asking various types of questions, and inquiring about a student’s experiences)
	SAP2. The dialogue prompts a student’s metacognitive thinking.	QAP2. Which one prompts a learner’s metacognitive thinking more?
Immersion	SI1. The dialogue appears to describe a specific and natural learning situation.	QI1. Which one describes a more specific and natural learning situation?
	SI2. The dialogue reveals and addresses a learner’s knowledge deficits more clearly.	QI2. Which one reveals and addresses a learner’s knowledge deficits more clearly?

Table 5: Measuring questions used in our expert evaluation of the Initial Generation pipeline and statements used in our human evaluation of the end-to-end pipeline of VIVID to measure the educational quality of designed dialogue.

6.3 Technical Evaluation

To evaluate whether VIVID supports authoring dialogues that meet the design requirements for educational dialogues (Section 4.2), we conducted a technical evaluation focusing on three primary parts: (1) Initial generation prompting pipeline, (2) our end-to-end pipeline designed for dialogue authoring, and (3) Correctness of the final dialogue. For the human evaluation of two pipeline outputs, we invited four instructors who participated in our user study to evaluate the pedagogical quality of the dialogues using the metrics shown in Table 5.

6.3.1 Initial generation prompting pipeline evaluation .

We created a test dataset to explore how dialogues are generated through the Initial Generation prompting pipeline because the pipeline is designed to play the most crucial role in generating quality dialogue. Notably, we aim to investigate whether the language and the subject factors affect the quality of the pipeline to test the generalizability of the system for different subjects and languages.

To do this, we construct our test dataset on two lecture videos. We selected science and mathematics lectures to use: topics for the science lecture (Properties of periodic waves)⁷ and topics for mathematics (Linear equation)⁸, as our target domain is STEM subjects. In addition, to compare across different languages, we selected Khan Academy videos with transcripts available in both Korean and English. For each subject, we selected one segment of approximately 2-3 minutes for dialogue generation. We generated 32 dialogues that consist of 16 Baseline evaluation dialogues (8 in Korean and 8 in English) and 16 VIVID evaluation dialogues (8 in Korean and 8 in English). Detailed test dataset generation process and dialogue examples are in Appendix.

Two evaluators evaluated the Korean dialogues, while the other two who are proficient in reading and listening in English assessed the English dialogues, utilizing the given evaluation metrics (Measuring questions column of Table 5). Each evaluator conducted a pairwise comparison on a set of 32 pairs of dialogues and was asked to choose the dialogue generated by VIVID or the Baseline condition. Then, we calculated the preference percentage of selecting each condition to provide a comprehensive view of the system comparison.

6.3.2 End-to-end dialogue authoring pipeline evaluation .

We assessed the final dialogues in two ways. Firstly, we compared the Likert scores to determine which one produced more Dynamic, Academically Productive, and Immersive dialogue. Secondly, we compared the percentage of incorrect responses for each dialogue to evaluate the variations in correctness before and after the instructor’s refinement and between the different conditions.

Dynamic, Academically Productivity, and Immersion evaluation of authored dialogues by VIVID. We collected expert evaluations on 20 dialogues designed during the user study, ten from Baseline and ten from VIVID. Each dialogue was evaluated by three or four evaluators, as evaluators did not evaluate the dialogues designed by themselves. The evaluators used the evaluation metrics shown in Statement column of Table 5, which consisted of six 7-point Likert-scale questions.

Correctness evaluation of authored dialogues in both conditions. During the Refinement stage, instructors were allowed to make direct modifications. We conducted an evaluation study with four instructors to validate our approach using 48 dialogues from our user study. 24 dialogues were generated before direct modifications by instructors, and 24 were created after modifications. Two instructors each evaluated the same dataset and their respective teaching subjects. We compared the two sets of dialogues to identify how our approach improved Correctness.

Based on the definition of typical hallucination [77, 78], we classified three types of incorrectness that may occur in educational dialogues on a turn-by-turn basis: (1) incorrect case where original numbers or explanations in the transcript were transformed incorrectly, (2) inconsistency observed when the answer deviates from the question from the student (e.g., when a student asks a question about logarithm function but the teacher provides an answer about the exponential function), and (3) inconsistencies observed across multiple turns (e.g., inconsistency in the student’s knowledge level).

Figure 7:

6.4 Technical Evaluation Results

The technical evaluation showed that instructors designed significantly higher quality educational dialogues using VIVID compared to Baseline in all criteria except SD1 (Figure 7). Our study also found that the Initial Generation stage produces better educational dialogues than the Baseline, with the exception of QD1 (Figure 9). However, the overall usefulness of each system feature was not significantly high among the instructors as we reported in Section 6.2, so we discussed possible reasons for the gap between the usefulness and quality of dialogues in Section 7.1.

6.4.1 Dynamic, Academically Productivity, and Immersion evaluation of authored dialogues by VIVID.

Technical evaluation results showed that the instructors created significantly better educational dialogues with VIVID than the baseline. As shown in Figure 7, the dialogues designed through the entire pipeline of VIVID were rated significantly higher in quality in all aspects, except for SD1, compared to the baseline. The most significant findings were shown in SD2 (VIVID: M= 5.11, SD= 1.45, Baseline: M= 3.4, SD= 1.73), SAP1 (VIVID: M= 5.47, SD= 1.13, Baseline: M= 3.7, SD= 1.9), SAP2 (VIVID: M= 5.64, SD= 1.4, Baseline: M= 3.3, SD= 2, p <= 1.00e-04, Wilcoxon signed-rank test). In other words, dialogues authored by the end-to-end pipeline of VIVID better described the metacognitive and cognitive activities of direct learners and consisted of more diverse patterns than the baseline. Difference of SI2 (p <= 0.001) and SI1 (p <= 0.01) also showed significance. The result implies that the dialogues authored with VIVID described a more natural learning situation and the direct learner’s knowledge deficit better than the baseline.

6.4.2 Correctness evaluation of authored dialogues in both conditions.

We analyzed the percentage of turns with errors in each dialogue. As shown in Figure 8, after the modification, the total incorrectness rate of 0%-10% increased from 71% (17) to 92% (22). Before modification, VIVID generated more incorrect dialogues than Baseline because VIVID had to consider more details about the direct learner’s understanding states when generating dialogue. After modification, the percentage of VIVID’s incorrect dialogues in the 0-10% range increased from 67% (8) to 92% (11), while Baseline increased from 75% (9) to 92% (11) (Figure 10). These results indicate that VIVID improved correctness better than Baseline, and the instructor’s refinement produced high-correctness final dialogues in both conditions. Additionally, we calculated the percentage of exceptional incorrect dialogues due to transcript errors (e.g., absence of essential conditions like ‘x < 0’, incorrect concept definition). Before the instructor made corrections, 25% of the entire dialogue resulted from incorrect transcripts. Even after the instructor’s refinement, around 17% persisted, especially due to the absence of essential conditions in math dialogues.

Figure 8:

Figure 9:

Figure 10:

6.4.3 Initial Generation pipeline evaluation.

The dialogues generated by VIVID’s Initial Generation pipeline were rated higher in quality compared to the corresponding baseline in terms of all metrics listed in Table 5, except QD1. The most significant difference (Baseline: 5.5%, VIVID: 86.7%) was on QAP2 (Table 5) as in the end-to-end pipeline (Section 6.4.1), indicating that VIVID generates initial dialogues that effectively reflect a direct learner’s metacognitive activity (Figure 9). Figure 9 shows that QD2, QAP1, QI1, and QI2 had over a 50% difference between VIVID and Baseline except for QD1 (Baseline: 71.01%, VIVID: 14.1%). We discussed the issue of poor quality for QD1 in Section 7.1.

Figure 11:

Dialogue quality difference by subject (Science, Math) As shown in Figure 11, the most significant difference between Baseline (2%) and VIVID (86%) was observed in the QD2 criterion, which is about diverse interaction patterns, in science dialogues. This suggests that VIVID effectively utilizes diverse dialogue patterns between the tutor and the direct learner, regardless of the language used.

Regarding math videos, the evaluation metric with the largest difference between Baseline (5%) and VIVID (84%) was QAP2 (Table 5). This indicates that VIVID is particularly effective in designing a dialogue that encourages metacognitive speech from a direct learner, regardless of the language used. On the other hand, QAP1 (Baseline: 13%, VIVID: 84%) and QI1 (Baseline: 8%, VIVID: 80%) showed smaller differences, but were still significant. In both subjects, evaluators preferred the dialogues generated by the Baseline in terms of QD1 (Table 5) which is about verbosity.

Dialogue quality difference by language (English, Korean) As shown in Figure 12, QAP2 (Table 5) had the most significant difference between Baseline (3%) and VIVID (91%) in English. This suggests that despite the subject, VIVID effectively created dialogues that elicited metacognitive activities from the instructor to the learner when the dialogues were converted from English to English. Yet, the Baseline performed better in terms of QD1 better than VIVID, both in English (Baseline: 41%, VIVID: 14%) and Korean (Baseline: 50%, VIVID: 4%) as in the results of dialogue quality difference by subject.

When converting Korean lecture into Korean dialogue, VIVID showed a significant contrast between Baseline (6%) and VIVID (92%) in the QI2 criterion while QAP1 had the least difference between Baseline (31%) and VIVID (51%). This indicates that VIVID effectively represented the learner’s knowledge gaps directly and clearly, and depicted the process of addressing these difficulties in the dialogue, regardless of the subject.

Figure 12:

7 Discussion

In this section, we discussed how to improve explainability, controllability, and verbosity for better utility, VIVID’s potential beyond lecture videos, customizability for learners, and its generalizability.

7.1 Human-AI Interaction Design for VIVID

The technical evaluation showed that dialogues created using VIVID were significantly better than Baseline in five out of six criteria (Figure 7), excluding fast turn-taking. Similarly, the quality of the VIVID-generated dialogues during Initial Generation phase was rated significantly higher. However, there was no significant difference in the instructors’ perceived efficiency and usefulness of VIVID compared to Baseline (Figure 6).

Despite the positive results for VIVID, the overall usefulness of each system feature was relatively low among the instructors (Figure 6). We attribute this to two reasons: (1) low explainability and informativeness of dialogue design described in the highlighting feature and dialogue cards and (2) low controllability of the laboratory feature. Thus, we suggest three improvements:

•

Enhancing Explainability: The highlighting feature and dialogue cards in VIVID need to offer greater explainability to the instructors. P7 highlighted that having prior knowledge of each feature’s exact functionality could have led to more frequent and appropriate usage, potentially resulting in a higher level of satisfaction with the system. Notably, there is a need to investigate the types of information instructors require to effectively discern the diversity among learners and pedagogical dialogue patterns. We observed substantial differences between instructors in their ability to recognize differences in direct learners’ understanding levels and how these differences are reflected in dialogue structure.

•

Providing Fine-grained Controllability: Enhancing controllability and providing granular modifications on the laboratory feature can improve instructors’ workflow. In our user study, we observed that instructors exhibited varying expectations for the modified versions offered by the laboratory, and they tended to rate usability lower when their expectations were not met. The improved version of laboratory feature could support the instructors in determining and expressing what features they would expect in the revised versions of the dialogue. For instance, enabling instructors to select elements, such as diverse versions of examples, questions, or versions with added prior knowledge with interactive guidance, could increase the perceived usefulness of the feature.

•

Improving Verbosity: One unexpected downside was that the generated dialogues were perceived to be verbose, which is likely due to LLM’s tendency to produce long text. This issue could be addressed by revising the prompting pipeline to limit the length of utterances and dialogues generated, which we leave as future work.

7.2 Potential Applications beyond the Video Lecture Context

In the educational context, dialogues can serve multiple roles, extending beyond the mere transmission of factual knowledge. In our user study, several instructors highlighted the adaptability of our dialogue design pipeline, suggesting its potential application in diverse learning contexts, instructional materials, and various learning stages. For instance, P7 proposed utilizing our dialogue design pipeline to use dialogues for the learner’s review, or to use dialogue as a means to diagnose the learner’s misconceptions by providing a dialogue in which the direct learner presents misconceptions.

Furthermore, VIVID and its process of transforming lectures into a dyadic format may take the role of a valuable active learning tool. Our dialogue design pipeline can be utilized in formulating questions in a dialogue format for learners and provide interactive guidance for the students’ self-learning process using digital textbooks or in flipped learning settings. Learners can gain a better understanding of complex concepts by analyzing educational content and exploring effective teaching strategies.

7.3 Customizable VIVID for learners

VIVID is a system that supports instructors in transforming their lecture videos into educational dialogues in text format. Yet, it is important to consider how these dialogues can be seamlessly incorporated into the video learning environment (VLE) to enrich learners’ experiences and optimize learning outcomes. We can integrate text-format dialogue into the VLE by delivering it in voice and text modes together, utilizing the VLE’s multi-modality. For instance, the dialogue can be converted into human-like speech and played alongside the corresponding lecture clip by replacing the original explanations with dialogue. Furthermore, vicarious learners can simultaneously explore multimodal dialogue incorporating formulas in the lecture within a chat-like interface.

While VIVID, designed for instructors, utilizes data pertinent to the levels of a vicarious learner group considered by instructors, it is limited when incorporating teaching strategies, like transfer learning (DR2 in Section 4.2) and personalization of dialogue, which demand each vicarious learner’s data, such as prior knowledge, personal background, and current understanding state. We believe that VIVID can be extended to collect data from vicarious learners by using a multi-modal representation of vicarious dialogue. This would enable customized modeling of direct learners, effective transfer learning, and personalization to vicarious learning. For instance, we can collect the data for generating personalized dialogue by requiring learners to click on challenging elements such as formulas or explanations within a lecture as they watch it. Therefore, future work should expand VIVID to include learners and evaluate dialogues from the learner-centered criteria, such as engagement and learning gain.

7.4 Generalizability of VIVID

Even when different lectures cover the same concept, variables such as material modalities, style of delivery, and language affect how a learner perceives and understands new knowledge. We found that instructors tend to adjust dialogue to fit their teaching style when the teaching style in the lecture differs from their preference. To enable instructors to use lecture videos of any teaching style and match them with their intended outcomes, it is necessary to explore a solution for converting dialogue, which includes a preprocessing step for scripting before the Initial Generation phase. To design dialogues based on lectures with varying teaching styles, VIVID needs to preprocess the lecture material to isolate core concepts, understand the instructor’s intention, and transform the knowledge into a personalized format that matches the user’s preferred teaching style.

Moreover, it is important to determine which lecture segments and lengths are suitable for a dialogue style. As P3 said, certain contents or subjects may be more suitable for dialogue formats to help learners better understand relatively complex concepts or examples. Further, we observed in our technical evaluation that the dialogue generation had varying degrees of improvement depending on the subject matter. Thus, enhancing the advantages of dialogue format can be achieved by understanding and reflecting on the differing effects of dialogue format between subjects in dialogue design.

7.5 Limitations and Future Work

We acknowledge several limitations in our current study. Firstly, the knowledge progression of the direct learner in the dialogue was not one-sided in VIVID. VIVID didn’t consider prerequisite relationships to create diverse dialogues (Section 5.1.2). Yet, some dialogues depicted direct learners initially understanding a concept but later appearing to lack understanding. Thus, redesigning the knowledge state setting pipeline is needed to maintain consistent knowledge levels and prevent reverse progression. Secondly, our experiments involved instructors designing dialogues for only a single segment within a lecture. However, the generated dialogues are influenced by factors such as the length of the selected segment, the type of content, and the subject. To explore VIVID’s use cases more deeply, it is necessary to conduct experiments under a more diverse set of conditions.

8 Conclusion

We present design recommendations from an extensive literature review and insights gathered during a design workshop. These recommendations are aimed at facilitating the creation of high-quality educational dialogues. To put these guidelines into practice, we have developed VIVID, a web application designed to assist instructors in authoring pedagogical dialogues from their monologue-style lecture videos. Through our technical evaluation and user study, we found that instructors can consider important factors in dialogue design effectively, generating Dynamic, Pedagogically productive, Immersive, and Correct dialogues. We hope VIVID helps create more engaging lecture videos, providing a personalized learning experience for online students.

A Workshop Details

A.1 Subjects and Lesson Content Used In Workshop

Table 6:

Group	Subject	Main lecture content
1 (P1, P2)	Math	Composite functions and inverse functions
2 (P3, P4)	Science	Phases of the moon and the reasons behind these lunar phases
3 (P5, P6)	Math	Concept of unit vectors and their alignment with a given vector
4 (P7, P8)	Science	Einstein’s General Theory of Relativity, covering concepts such as the warping of spacetime but massive celestial bodies, gravitational lensing, time dilation due to gravity, and phenomena associated with black holes
5 (P9, P10)	Math	Concepts of radical (square root) functions and rational (fractional) functions
6 (P11, P12)	Math	Process of transforming a quadratic equation into a perfect square trinomial and then using square roots to find the solutions
7 (P13, P14)	Math	Method of expressing a third line passing through the intersection of two given lines and determining the equation of a line, even with an unknown slope, passing through a specified point
8 (P15)	Math	Classification of integers based on the remainders when divided by a positive integer

Table 6: Subjects and lesson contents of the lecture that were addressed in the workshop by each group.

Table 6 indicates the subjects and lesson contents used in our workshop.

A.2 Teaching Strategies for Designing Pedagogically Effective Dialogue.

Table 7:

By initiative	Key strategies that can be effective to vicarious learners	Description	Example Dialogues between a tutor and a tutee
Tutor	Cognitive conflict	It is a teaching strategy that examines the learner’s prior knowledge, creates a mismatch situation that causes conflict, and then helps the learners to see that his or her understanding is incorrect.	Tutor Tutee Tutor	Absolutely, that’s the usual method. But let me throw a curveball. What if I told you that solving them using a different approach might lead to a different solution? Really? I thought there was only one way to solve equations. That’s what we’re here to explore! Let’s try this. Instead of isolating x right away, …
	Metacognitive prompting	It orients learners towards higher-level strategies (e.g., goal-setting, planning, monitoring, evaluation, reflection). It includes an instructor’s utterances that encourage the learner to express their current level of understanding or articulate their thought process.	Tutor Tutee Tutor	Got it. How about we take a slightly different approach this time? Before you jump into solving, let’s start by identifying what the problem is asking. Can you read the question and tell me what this question is requesting? Sure. It’s asking me to solve for the sum of ‘x’ and ‘y’ in the equation. That’s a nice interpretation, but let’s take a closer look.
	Cognitive prompting	It engages learners in lower-level strategies (e.g., organization, rehearsal, elaboration). It includes the instructor’s utterances that prompt the learner to talk about what they are learning or to draw out the learner’s prior knowledge and personal experiences.	Tutor Tutee	As you work through an equation, think about the basic operations you’ve learned. Can you explain how these operations are helping you manipulate this equation? Sure. When there’s addition on one side, I subtract to balance it out. And if it’s multiplication, I divide to get ‘x’ by itself.
Tutee	Spontaneous deep-level reasoning question	It refers to starting a conversation where the learner spontaneously asks deep-level reasoning questions that help them better understand and engage in critical thinking.	Tutee Tutor	How can a manufacturer increase the speed of the computer? What can they do to make it faster? Well, one thing manufacturers do is increase the clock speed of the computer.

Table 7: Four teaching strategies for pedagogically effective dialogue: Cognitive conflict, Metacognitive prompting, Cognitive prompting, and Spontaneous deep-level reasoning question.

Table 7 indicates the teaching strategies for designing pedagogically effective dialogue.

B Dialogue Generation Examples

B.1 Evaluation Dataset Generation Process

Figure 13:

Figure 13 indicates the evaluation dataset generation process.

B.2 Transcript Example

Figure 14:

The green section indicates the area highlighted by the authors as potentially challenging for vicarious learners to understand. VIVID creates a direct learner who lacks knowledge of this green area.

B.3 Example 1

Figure 15:

Figure 16:

Figure 15 and Figure 16 are two dialogue examples generated based on the English script of the physics lecture.

B.4 Example 2

Figure 17 and Figure 18 are two dialogue examples generated from the Korean script of the physics lecture. We translated them into English as the output was Korean.

Figure 17:

Figure 18:

Footnotes

https://react.dev/

https://flask.palletsprojects.com/

https://www.youtube.com/watch?v=u0KO1rm8neI

https://www.youtube.com/watch?v=64dZGBCELBc

https://www.youtube.com/watch?v=FBAgxbQ931Y

https://www.youtube.com/watch?v=I_H04p9HHcI

https://www.youtube.com/watch?v=tJW_a6JeXD8&t=345s

https://www.youtube.com/watch?v=9IUEk9fn2Vs

Supplemental Material

MP4 File - Video Preview

Video Preview

Transcript for: Video Preview

MP4 File - Video Presentation

Video Presentation

Transcript for: Video Presentation

PDF File - 3642867-VoR

Version of Record for "VIVID: Human-AI Collaborative Authoring of Vicarious Dialogues from Lecture Videos" by Choi et al., Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI '24).

Download
4.58 MB

References

[1]

John R Anderson, C Franklin Boyle, and Brian J Reiser. 1985. Intelligent tutoring systems. Science 228, 4698 (1985), 456–462.