5.2 Results
Overall, participants rated using AVscript to edit videos as requiring significantly less effort (
μ=4.58,
σ=1.51 vs.
μ=2.17,
σ=1.11;
Z=2.96,
p<0.01), frustration (
μ=3.58,
σ=2.11 vs.
μ=2.08,
σ=1.31;
Z=2.39,
p<0.05), mental demand (
μ=3.17,
σ=1.80 vs.
μ=2.00,
σ=0.95;
Z=2.03,
p<0.05), and temporal demand (
μ=4.5,
σ=1.93 vs.
μ=2.58,
σ=1.16;
Z=2.54,
p<0.05) compared to their own editing tools that they were experienced with (Figure
6). Perceived performance and physical demand were not significantly decreased for AVscript, and all significance testing was performed with the Wilcoxon Signed Rank test. All participants stated they would like to use AVscript in the future for reviewing and editing their videos.
We report the statistics of the videos edited by participants in Table
1. While 30 minutes were given for each editing session, six participants using AVscript finished the task early. Due to the limited time, ten participants using their own editing tools did not edit the later part of the footage (P4, P8, P9, P11-17). The Video Timeline column in Table
1 shows the edited time segments over the timeline of the videos. As participants using their own tools often did not reach the second half of the video, the output videos in the baseline condition included notable errors in the latter half of the video such as leaving in dark scenes (V1), long pauses (V2), and repetitive actions (V2). However, across both conditions, short edits to the video timeline often introduced jump cuts [
37,
53] in the final output video.
Figure
7 summarizes how creators used AVscript by visualizing operation sequences relevant to navigation and editing. Overall, participants frequently jumped between different parts of the video using the headings, transcript lines, and words in the
audio-visual script (Figure
7, light blue “Text Jump” cells). Four participants (P10, P14, P16, P17) used the search feature once (Figure
7, blue “Search Jump” cells). Participants frequently deleted speech, pauses, and visual errors in the video (Figure
7, yellow, orange and red “deletion” cells). Because AVscript’s
audio-visual script is aligned with the video timeline and contains descriptions of pauses and errors (
e.g., duration, error type), five participants (P4, P8, P9, P16, P17) often subsequently deleted problematic segments only using text descriptions without actually playing the video. In addition to deleting clips, participants tried to recover from pauses and visual errors by trimming or changing the speed; five participants trimmed pause segments (P8, P10, P15, P16, P17) and one participant changed the playback speed (P1).
5.2.1 Reviewing Videos and Identifying Errors to Edit.
Participants rated AVscript as significantly more helpful for reviewing their video footage to identify errors compared to their existing editing tools (μ=4.25, σ=2.22 vs. μ=2.00, σ=1.04; Z=2.17, p<0.01). When reflecting on their final video, participants expressed that they were more confident with their final result (μ=4.67, σ=1.37 vs. μ=3.00, σ=1.41; Z=2.34, p<0.01), and needed less assistance reviewing it (μ=5, σ=1.54 vs. μ=2.75, σ=1.66; Z=2.31, p<0.01) compared to their typical process.
Text-based vs. Timeline-based Video Review. Using AVscript, participants primarily reviewed the video by reading the text of the audio-visual script and outline, while with their own video editing tools participants primarily reviewed the video by playing the video. For example, of 7 participants who reviewed the entire video before editing it with AVscript, five participants read the entire audio-visual script using a screen reader or braille display without playing the video (P4, P9, P10, P16, P17), and three read the outline to gain an overview of the video (P10, P11, P13). P10 did both. On the other hand, when using their baseline tools, all participants played the video from the beginning to identify points to edit. Participants expressed that reading the text of the audio-visual script or outline allowed them to skim the footage faster than the video alone. P16 who reviewed the 11-minute video with AVscript in 3 minutes remarked, “I’ve been using NVDA [screen reader] for so long that I can understand a very fast TTS [Text-To-Speech]. Because I read 1,075 words-per-minute reading the script instead of playing video saves so much time for me.”
Gaining an Overview of Visual Content and Errors. In addition to providing options for faster review, participants reported that they used AVscript’s high-level description of visual scenes and errors to (1) form a mental picture of the visual content (e.g., connecting background sounds with the scene descriptions, or imagining what the scene contained), (2) plan what edits they would like to make later (e.g., get an overview of the parts of the video that they needed to edit), and (3) mentally bookmark their progress as they edited (e.g., using a scene title to remember they had left off editing). P16 remarked that the descriptions were particularly helpful for silences: “Even in silence, I know what is going on in this video! Reading these scene labels, I can construct mental imagery of what the footage looks like.” P10 and P14 also used the inspect feature in concert with the high-level descriptions of visual content and errors (e.g., to understand the content of a pause, and to access objects at the beginning of a scene).
Identifying Opportunities to Edit Video Footage. Participants considered AVscript’s visual errors in making decisions for visual editing, while they edited audio errors (e.g., pauses, and repeated words) with both systems. Using AVscript, all participants reviewed the visual errors in the video, and 11 of the 12 participants AVscript edited a visual error (e.g., by deleting, speeding up, or trimming the error). When evaluating visual errors, participants read the error along with the speech associated with the error to decide whether to delete it or not. For example, when assessing a visual error that overlapped with an important sentence in the speech that would harm the meaning of the speech if deleted, participants left the footage intact. On the other hand, if participants could make a natural edit (e.g., cutting out an unnecessary sentence, or trimming the length of the error) they would cut it out. To edit the errors detected by AVscript, 11 participants deleted the entire segment of the error, whereas one participant changed the playback of the error segment leaving some part of it. P13 stated, “If I just get rid of the error, it might result in a jumpcut or leave a too small gap between the sentences which is unnatural.” Participants expressed that getting informed of the visual errors made them more confident in their edits, but P4, P9, P11, and P12 noted they would like severity information about the error to inform quality vs. content trade-offs. P12 noted “It says bad lighting, but what I want to know is how bad so that I can make a decision whether to keep it, fix it, or remove it.”
With both systems, participants edited out irrelevant footage and audio errors (e.g, pauses, repeated words). With AVscript, participants made edits at word level or line level (a sentence, a long phrase, or a pause) and sometimes removed multiple lines at once when they decided not to keep a big chunk of the scene that they did not find interesting. Using their own editing tools, all participants made edits to remove filler words or pauses between speeches, and some participants similarly deleted uninteresting content.
5.2.2 Navigating and Applying Edits.
While participants found AVscript to be beneficial for high-level navigation and editing operations (e.g., by scenes, lines, words, long pauses) and non-linear navigation, the current version lacked the fine-grained navigation and editing provided by their typical video editors that enables participants to edit fine-grained audio (e.g., short pauses). As participants found AVscript to be helpful for some navigation tasks more than others, participants did not rate AVscript to be significantly more helpful for their existing tools for navigation (μ=2.5,σ=2.11 vs. μ=1.3,σ=0.78; Z=1.63, p>0.05) or applying edits (μ=2.58,σ=1.73 vs. μ=1.83,σ=1.19; Z=0.99, p>0.05).
Coarse-Grained Navigation. Using AVscript’s audio visual script, all participants efficiently navigated the video content by moving the cursor in the transcript both line-by-line (up/down arrow keys) and word-by-word (alt/option + right/left arrow keys). P12 and P16 also jumped to the next scene in the audio-visual script by pressing the ‘H’ key in the screen reader’s browser mode (used to navigate to the next heading element). As participants edited the video, 7 participants also used the outline pane to quickly navigate to a scene or an error suggestion. In contrast, using their typical video editors’ timelines all participants navigated by skipping ahead in a fixed time or frame interval (e.g., skip ahead 5 seconds) rather than by content (e.g., sentence, word, pause, error or scene). Participants then needed to iterate multiple times to find the relevant cut point, as described by P11: “To delete one word, I have to navigate so many times to precisely set the start and end of what I want to cut out. So I often create a small loop around the target just for editing.” Four participants also scrubbed backward or forwards to navigate to a near word or pause target (P8, P10, P11, P14) despite its disadvantages: ‘The scrubbing audio makes no sense to me, but it can still be used to detect pauses” (P11). P10 and P11 also used the tab key in Reaper to jump to the next audio peak to locate the end of long silences.
Fine-Grained Navigation. While AVscript makes editing words or pauses convenient, participants asked that in the future AVscript also include frame- and interval-level navigation to facilitate fine-grained adjustments to the cursor placement, especially when speech is not present. In addition, as the system limited the pauses displayed to screen reader users to 3s long to optimize skimming the audio-visual script, participants expressed that they wanted a mode for fine-grained edits that would display small pauses.
Non-linear navigation. Participants also used AVscript to efficiently navigate the video non-linearly, using the outline to navigate to an error they would like to edit, then moving their cursors back to play from a few lines prior to figuring out where to make the edit by considering the audio content and the visual error together (P4, P9). Four participants used the search pane to find and skip to a specific part in the script (P10, P14, P16, P17). P10 exclaimed: “This search feature is revolutionary! I can search not just for text, but an object or even pauses so easily.” Yet, participants who never used the search feature to navigate the video speculated that searching for visual content would be more useful for their own videos. P9 noted “I didn’t know what to search for as I didn’t film this video. If I use it (AVscript) for my own video, I will definitely find it useful.”
Applying edits. The ability to apply edits with AVscript was limited to high-level edits of the video footage itself. With their own editing tools, participants additionally applied effects to improve the audio or visual quality of the footage, including: applying a high pass filter to remove background noise and heavy breaths (P13), inserting music and adjusted its volume so that the original audio is louder than the music (P17), adding an intro and credit to the footage by inserting a black image with white text (P17). After making a cut in the video, P15 and P17 used a transition effect to avoid the abrupt jump in the audio or visual. With When making edits, 2 participants often referred to a help menu, or a self-created list of hotkeys and commands to remember the keys they should use (P4, P8). Participants who didn’t use the built-in video player of the editing tool read the timestamps from the player and then passed them into the command line (P4 using FFmpeg), or to the input field of the tool (P16 using VideoReDo). Both P4 and P16 noted the inconvenience of switching between two different interfaces. P16 said “Because the video player and VideoReDo use different time formats, I cannot directly copy and paste. When I manually read and type them, I sometimes make typos and this makes very confusing results.” P4 also mentioned “While the script-based editing is very accessible, I have to run the command after each edit to check the results. If it’s a long video, I have to wait for a long time for the video to be processed.”