Studying Natural User Interfaces for Smart Video Annotation towards Ubiquitous Environments

Rui Rodrigues, Sustain.RD, ESTSetubal Polytechnic Inst. Setubal, Portugal and FCT / Nova Lincs, NOVA University of Lisbon, Portugal, rui.rodrigues@estsetubal.ips.pt

Rui Neves Madeira, Sustain.RD, ESTSetubal Polytechnic Inst. Setubal, Portugal and FCT / Nova Lincs, NOVA University of Lisbon, Portugal, rui.madeira@estsetubal.ips.pt

Nuno Correia, FCT / Nova Lincs, NOVA University of Lisbon, Portugal, nmc@fct.unl.pt

DOI: https://doi.org/10.1145/3490632.3490672
MUM 2021: 20th International Conference on Mobile and Ubiquitous Multimedia, Leuven, Belgium, December 2021

Creativity and inspiration for problem-solving are critical skills in a group-based learning environment. Communication procedures have seen continuous adjustments over the years, with increased multimedia elements usage like videos to provide superior audience impact. Annotations are a valuable approach for remembering, reflecting, reasoning, and sharing thoughts on the learning process. However, it is hard to control playback flow and add potential notes during video presentations, such as in a classroom context. Teachers often need to move around the classroom to interact with the students, which leads to situations where they are physically far from the computer. Therefore, we developed a multimodal web video annotation tool that combines a voice interaction module with manual annotation capabilities for more intelligent natural interactions towards ubiquitous environments. We observed current video annotation practices and created a new set of principles to guide our research work. Natural language enables users to express their intended actions while interacting with the web video player for annotation purposes. We have developed a customized set of natural language expressions that map the user speech to specific software operations through studying and integrating new artificial intelligence techniques. Finally, the paper presents positive results gathered from a user study conducted to evaluate our solution.

CCS Concepts: • Human-centered computing → Natural language interfaces;

KEYWORDS: Multimodal Interfaces, Video Annotation, Speech Interfaces, Natural Language Processing, AI-Based Tools, HCI in Ubiquitous Environments

ACM Reference Format:
Rui Rodrigues, Rui Neves Madeira and Nuno Correia. 2021. Studying Natural User Interfaces for Smart Video Annotation towards Ubiquitous Environments. In 20th International Conference on Mobile and Ubiquitous Multimedia (MUM 2021), December 5–8, 2021, Leuven, Belgium. ACM, New York, NY, USA, 16 Pages. https://doi.org/10.1145/3490632.3490672

1 INTRODUCTION

Multimedia resources are important elements to use in order to augment our communication capacity and achieve better audience attention. Video technologies may present a great potential to support and positively impact learning and education [21, 33]. For traditional open education, MOOCs (Massive Open Online Courses), video-based learning content is one of the most critical materials for distance or online learners. Additionally, videos are an essential teaching resource in the context of a flipped classroom, within a blended learning model, where we have online combined with physical classes.

The use of annotations is an important learning and reviewing strategy. Traditionally, after studying a text, the student writes down her/his thoughts and highlights important content summary for invoking their memory or offering hints for the future. This annotations-based approach helps the student by improving the understanding during the learning process.

Video annotation tools are essential tools for stimulating user creativity in order to add and share information. The interfaces for video annotation are often based on a traditional WIMP interaction [24, 40], which is not a problem when working alone on a video. However, presenting video in a classroom context requires the lecturer to be standing next to the device desk all the time to control the video playback or save comments and reflection about the current scene. Still, usually, the lecturer needs to have some mobility in the classroom between the computer, the projection, and the students to keep good communication. Additionally, during the talk, natural behaviour from the lecturer is essential to get the audience's interest. Unnecessary movements are required to manage the video playback, creating an unnatural pause that can decrease the audience's focus.

Researchers have wanted to talk to computers almost since the first one was invented [17, 22]. In the last few years, smart assistants with speech interfaces became very popular; Apple's Siri [45], Amazon's Alexa [2], and Microsoft's Cortana [14] enable users to accomplish complex operations efficiently, including sending text messages, obtaining navigation information, booking a taxi, among other similar tasks, toward pervasive and ubiquitous environments. Speech interfaces can render interactions much friendlier since they enable users to state their objectives without learning the new software interface. However, video annotation is a complex task to be accomplished only through speech. Previous work has shown that visual tasks are best accomplished by combining speech and manual input (e.g., touch, mouse, keyboard) [12, 18, 48].

Previously, we conducted several test sessions where users could be observed working with a previous version of our video annotation tool. We identified a mix consisting of actions, symbols, sketches, and text used during those sessions. Later, we concluded that such communication presents the potential to benefit from interaction based on both speech and manual input. We intend to improve user interaction in video annotation tasks while using our proposed interface. Our research uses a set of design ideas for video annotation collected during previous observations of annotation practices.

We implemented MotionNotes, a web video annotation tool that enables users to express their desired actions using voice recognition and natural language. This tool was developed in the context of dance annotation analysis, yet different application environments are taking advantage of our video annotation features. A user can ask the player to pause the video in a specific frame, saying “Pause the video” and then say, “I want to add a text annotation”. MotionNotes adds a textbox on top of the video and collects the user's speech. Finally, the speech is converted to text and displayed on the screen inside the textbox.

The natural language module was developed to be applied to the video player and annotation domain; this module can map user sentences onto software operations. The system records the user voice with an HTML 5 audio API that already exists in all modern web browsers. Then, the audio is segmented in multiple audio files and then processed and mapped to predefined intents. MotionNotes processes simple instructions in the browser while more complex commands are sent and handled on the server-side. This strategy was followed since we observed while developing the MotionNotes speech module that a two-level interpretation process provides superior results. First, a more domain-specific method and, then, if success is not achieved, it will try another sub-module with a more extensive vocabulary.

We performed tests with 27 people to understand the real benefit of using a speech module in conjunction with the manual interface in video annotation tasks. Afterward, the tool was also tested in a classroom context, achieving positive feedback. This use case appeared after we informally interviewed several teachers, whose experience told them that our solution could work very well in certain classes. Therefore, after conducting the needs assessment, we intended to investigate if this could be the case, intending to learn how it could work properly in a classroom context towards a ubiquitous computing environment.

This paper presents the following main contributions:

The Web-based application that combines speech with manual interaction for video playback control and digital annotations.
A user study based on the manual and speech interface, reflecting the challenges and benefits behind this approach.
The knowledge gathered with the use case of the classroom context, where the video annotation software was used with the speech module.

The paper is structured as follows. We start by analyzing the related work in Section 2, followed by the MotionNotes description, its speech integration architecture, technology, and features in Sections 3, 4, and 5. Afterward, section 6 presents the user study and the results achieved. Finally, we make our conclusions and present future work in sections 7 and 8.

2 RELATED WORK

The use of digital content in lectures increases the audience's interest and concentration by using multimedia elements and design customization applied to the respective topic [47, 50]. Presentation software usually requires the presenter to be next to the terminal while running the application and interacting with it to control the presentation flow. This problem was mitigated by introducing wireless remote controllers, giving the presenter better flexibility regarding his position in the presentation room. Researchers have explored other solutions to address this problem, making it possible to find literature regarding approaches to control presentations with human gestures [36] and voice commands [11].

Regarding practices more focused on video content to create debate and brainstorming, it is not easy to find previous studies flexible enough to give us position freedom around a room while controlling the playback. For instance, pre-recorded videos as a teaching technique are becoming very popular across the education area [13]. Videos can optimize face-to-face time, enabling more collaborative activities that provide greater opportunities for students to interact and consequentially foster creativity. Additionally, video has become more interactive, with software enabling timestamped annotation features, where users can make comments, marks, add ideas for the future, and share them with colleagues [26, 41]. Annotations are often used in various tasks daily. A classic scenario is the one in which students add notes in the classroom, which motivated some research on how video annotation could impact learning activities [19, 35].

In a context of a presentation room where the presenter needs to be moving and interacting with the audience, the interaction with the video player can benefit from multimodal interaction. People interact with their surroundings using multimodal channels. The Human-computer Interaction is performed using these capabilities, providing users, as much as possible, the most natural and productive experience to complete tasks [51]. Speech and natural language interfaces have been studied in multiple application fields, including automated web chatbots [37], smartphones [4], home media systems [27], image editing [25], car interfaces [29], industrial applications [3], among others. However, interaction through natural language interfaces for video players and annotation practices are still rare.

Our system supports multiple command types, including video player operations, multiple annotation types, and speech to text. Previous studies on interfaces with speech recognition and natural language understanding demonstrated good outcomes, especially when the voice was combined with other modalities [34].

Video annotation is a valuable resource in different application areas, which motivated the development of several tools: Elan [54] is one of the best-known annotation tools, with applications in many areas such as language documentation, sign language, and gesture research. Choreographer's Notebook [44] was explicitly designed to be used on Choreography workflow, enabling digital-ink and text annotations. BalOnSe [38] is a web implementation that allows users to annotate classical ballet videos with a hierarchical domain-specific vocabulary and provides an archival system for dance videos. Cabral et al. [5, 6] presented Creation-Tool, a project optimized for tablets and pen-based interaction to create different types of annotations, while Ribeiro et al. [39] have expanded this methodology into 3D visualizations. VideoTraces [49] allows users to capture the video and annotate it by talking or gesturing, and it also enables users to evaluate recorded dance performances and share those evaluations with peers for discussion. Commercial video annotation applications such as Wipster [53], Camtasia [8], Frame.io [15], and Vimeo Pro Review [52] have simplified annotating and sharing videos for users. However, none of them supports speech commands for interacting with the applications.

Researchers have studied how to integrate speech and natural language interfaces into video players. Singh et al. [43] created a Voice-Controlled Media Player capable of recognizing seven different speech commands. Chang et al. [10] developed research that led to voice-based navigation for How-To Videos and presented a set of design recommendations for this kind of interaction. Remap is a multimodal search interface that helps users find video assistance via speech, improving users’ focus on their primary task [16]. Interaction in the Internet of Things field can also be enhanced by using multimodal and speech interfaces, as the Minuet system demonstrated [23]. Still, none of these systems supports annotation practices through voice modules.

3 PRELIMINARY LAB EXPERIMENT

Video annotation can be very time-consuming, needing to add different levels of detail to transmit ideas properly. Professional dancers, art directors, teachers, football coaches, and choreographers annotate their work to adjust and improve performance. For the suggested modifications, these instructions must be clear and specific for other people such as actors, dancers, students, and athletes to understand easily.

In order to evaluate the tool's potential and understand the language used in the video annotation procedure, a testing session was performed during two lab days, using a previous version of MotionNotes.

3.1 Participants and Design

The first day was conducted with users who had already experience with other annotation tools, and the second day with amateurs, first-time annotation users. These two sessions have been used to collect important information to help build an initial vocabulary structure to support speech recognition. It was also crucial to observing how casual media player users with little experience in annotation reacted to our prototype.

We invited 19 participants for these lab days (ten with little to no experience). The most representative age interval was between 18 and 24 years old with 66.7%, followed by 25-34 with 18.5% and 35-44 with 14.8%. The gender representation was nearly even, with 53% female and 47% male. Regarding the education level, 37% had a master's degree, 32% had a bachelor's degree, 16% had studied until high school, and the remaining 15% held a PhD degree.

The first task requested them to record and interact with a short video demo (see Figure 1). After several minutes, we collected all the observations on how they thought annotations were needed. Next, we asked the participants to annotate videos using the prototype by following a to-do list of exercises. Those exercises were created so that the users would have contact with each type of annotation and software feature. By the end of the test, each participant was able to accomplish all the requested annotation tasks.

Figure 1: MotionNotes preliminary lab experiment environment.

3.2 Results and Discussion

The first impressions from users were satisfactory, although some minimal issues emerged, typical minor faults usually found in preliminary software versions. These problems were solved with slightly more guidance from the researchers; the opportunity was used to collect more information and vocabulary in order to build our speech recognition system.

As expected, users took advantage of their prior knowledge of annotation work. For instance, users who had more experience executed more complete performances, both in total numbers and also by using multiple types of annotations to express their ideas.

A common point identified was that users prefer to use laptop devices rather than other types such as desktop computers, smartphones, or tablets. Users significantly chose to combine drawings and text types to provide a more comprehensive and clear idea of what to improve in their scenes. Since MotionNotes enables users to annotate both in live recording and post-recording scenarios, a common issue identified was that users found operations much easier to engage in after recorded scenarios, contrasting with the live recording scenarios, where the number of annotations was smaller.

As we initially thought, this first test was essential to re-direct our ideas and drive the current MotionNotes development forward. Contacting final users always helps redefine development's priorities, understanding which features were perceived to be more critical and valuable. The latest version is described in the following chapters.

4 TOOL DESCRIPTION

The design guidelines were preserved from the previous versions due to having been very well accepted (see Figure 2). The menus are on the top, on the left multiple input modalities, with their properties on the right. MotionNotes uses a canvas layer overlaid on top of the displayed video frame to enable users to select and customize the position of any annotation. For example, a user can sketch on top of the video, and MotionNotes saves the ink strokes in the particular region with the specific timestamp. Editing and customizing all annotation types is also possible [32, 42]. The speech interface was integrated into this new version of MotionNotes in a careful way to prevent damaging the look and feel of the previous version.

Figure 2: Annotated video frame with text annotation on top-left and drawn annotation in the bottom center.

We developed support for multiple commands to execute the most common tasks, e.g., pause, play or stop the video. Moreover, there was a concern with different word sequences for the same action to give users significant freedom when invoking these commands. The objective was to identify commands from specific sentences and also from more subjective and open-ended ones. One example is asking for the desired item: “MotionNotes, I need a new text annotation in this frame, please”.

Therefore, we identified three critical visual features that would be added to the tool:

Command engaged feedback. Users must have a visual signal indicating if a given speech command was successfully received and interpreted or not.
A Speech Command List. A considerable number of commands are available and can be activated by multiple sentences. Therefore, a mechanism to help users find and learn this topic was required.
Annotation position. Users should be free to annotate in any part of the video frame; the system must enable the selection of multiple regions. For instance, the spatial areas such as the top right and the center can be indicated using speech. Touch, mouse, and digital pen are always available for a more specific position selection.

4.1 Working with MotionNotes

After the initial experiments and reflecting on the outcomes learned, an updated version of MotionNotes was developed. The tool enables users to describe their goals using different annotation types and to use their voices to operate the tool. In order to better understand the interaction, we follow Matilde while she annotates her video with MotionNotes.

First, Matilde opens MotionNotes and would like to load a video that she recorded in a recent go-kart competition. Matilde says, “I want to import my new video,” and MotionNotes immediately opens a new window. Matilde browses and selects her video. Once done, she says, “Can you play this video, please?” and the video instantly starts playing. She notices that the first corner trajectory could be better and feels the need to highlight this. Matilde then says, “Pause the video here.” and “I need to add a text annotation.” MotionNotes immediately creates a new textbox and asks for the annotation. Next, Matilde starts saying, “Need to widen the corner trajectory to obtain more speed in the next straightaway.” The software automatically transforms the speech into text and puts it in the textbox. Finally, Matilde says, “I need to draw.” MotionNotes activates the draw functionality; she creates a circle around the location on the circuit where the kart should be.

Regarding the input audio segmentation, the system starts to listen to everything after Matilde turns on the speech module, searching for a keyword, which in our case is MotionNotes. After each keyword detection, the following words are processed against the utterances to identify intents and perform the expected actions.

When Matilde wants to highlight a particular frame of her video but is unsure about what to do, for instance, “I think I will need an annotation here”. MotionNotes processes the information and does not match with any command available. In these cases, a popup window appears with the list of available voice commands.

4.2 MotionNotes Details

MotionNotes is a single-page application (SPA), meaning that this tool was developed using a client-server architecture to enable users to share their videos and annotations. The web browser works as the application client due to wide availability and compatibility with almost every user device with internet access. Compatibility with the main browsers available was also tested: Google Chrome, Mozilla Firefox, Microsoft Edge, and Safari. MotionNotes supports 15 different commands, developed using JavaScript.

5 SPEECH RECOGNITION

Speech APIs based on machine learning techniques were integrated into our solution in order to extract meaning from user voice interactions. MotionNotes used the extracted meaning to process and execute operations. A considerable number of external speech APIs were analyzed by our team, resulting in categorizing two types of tools: 1) local speech recognition and 2) cloud-based speech recognition. For MotionNotes, we selected an API from each category, providing quicker performance for simple instructions and superior vocabulary support for more complex ones (see Figure 3).

Local speech processing - When users turn on the speech interface, MotionNotes instantly activates the system microphone, and all words spoken are processed by the local browser-based machine learning algorithm. The data previously collected was used to train a local neural network. For the simplest instructions, this technique provided satisfactory results with minimal latency. MotionNotes uses tenserflow.js [20, 46] and the ML5 sound classification library [31] for the first layer of speech processing.
Cloud speech processing - Sometimes, the local speech algorithm, which is the first layer, returns results with low confidence levels consisting of classifications with a confidence lower than 50%. MotionNotes immediately sends the recorded audio data to a remote speech API service. This approach is less efficient than the local implementation but provides MotionNotes access to a more powerful speech processing system, increasing the success rates. We used the WIT.AI HTTP API [1, 30].

5.1 Speech Processing

The voice is collected using the audio input hardware available in the client's device, traditionally a microphone. The user should grant permission to the browser to provide access to this resource; we use the HTML5 MediaDevices API. Once the speech module is activated, our machine learning algorithm classifies every sound recorded in real-time. A parser function was also developed to analyze all the classification results returned from the algorithm, and in a high confidence result situation, the command is executed.

The recorded audio is sent to a cloud platform if the parser function receives a classification result with confidence under 50%. The audio is then converted to text and processed against a knowledge base composed of intents, entities, and utterances. This approach functions well when the order of words is mixed-up or a sentence is poorly formatted, such as “this video, play it.” This technique frees users from the demands of following rigid speech structures. Our team created both local and remote datasets, the experience acquired during the preliminary lab experiment (described in section 3) was fundamental for accomplishing this task.

5.2 Local Speech Setup and Training

For the local classification (first layer), a solution capable of working in client web browsers was used. This approach simultaneously prevents high network traffic loads and ensures suitable response times. ML5.js is an open-source, high-level interface for TensorFlow.js, a library for handling GPU-accelerated mathematical operations and memory management for machine learning algorithms. This JavaScript library can work with pre-trained models, including detecting human movement in the video, generating text, styling images, detecting pitch, and classifying sound. This last feature, the ML5 sound classifier, enables audio classification through the appropriate pre-trained models, detecting whether a specific noise is made (such as a clapping sound) or if a particular word is said. This classifier uses the web browser's WebAudio API and is built on top of TensorFlow.js to infer learning entirely in the browser.

Another critical component to make our local algorithm functional was creating a new model specialized for web media video players and annotation practices. To accomplish this, Google's Teachable Machine was used [9]. This tool was created to assist in developing custom machine learning classification models without needing substantial technical expertise. The configuration followed several steps: first, we recorded the audio files that contained a considerable number of examples for each voice command. These samples were created with several voices, sometimes including background noises, to provide better performance in real-world conditions. This procedure required a long fine-tuning process with several iterations until the model started to prove itself sufficiently solid. Finally, the model was exported and integrated with the ML5 sound classifier.

Currently, the system can identify nine intents locally without leaving the client's Web Browser. This intent subset can detect the most used commands, like the engage key, play, pause, stop and the annotation triggers. In our current implementation, we use 214 utterances, which have a connection to the nine possible intents (See table 1).

Table 1: Local and remote speech training.

Intents	Local Utterances (ML5)	Remote Utterances (WIT.AI)
Engage Key	62
Import Video		20
Select Video		18
Index Video		24
Close Selection		14
Play Video	25	26
Pause Video	14	12
Stop Video	12	10
Rewind Video		20
Fast Forward		18
Draw Annotation	25	24
Text Annotation	35	36
Voice Annotation	15	14
Marks Annotation	10	10
Help	16	16

5.3 Remote Speech Training and Setup

When users need to work with speech commands, one of the oldest problems is mapping unknown words to known operations, e.g., handling speech input containing words outside our core vocabulary of operations. Once again, we managed this issue with an external API. WIT.AI is a cloud platform owned and provided by Facebook since 2015; it offers developers an easy interface to create applications and connect devices. This system is a natural language interface for applications capable of turning sentences into structured data. This platform was primarily developed to help build robots with the ability to interact with humans on messaging platforms through text or speech. Creating a chatbot presents two major challenges: natural language processing (NLP) [7] and natural language understanding (NLU) [28]. NLP concerns splitting sentences into parts of commonly named entities, while NLU focuses on understanding what the sentences mean.

With MotionNotes, we intend to take advantage of these services to facilitate and improve our speech modules. Therefore, we use a WIT.AI API that receives user text and voice as input, returning the output as intents and entities. Intents correspond to user intentions, which are basically what a sentence is intended for, while entities are variables that comprehend the details of the user's task. Reliant on the intents and entities, our application can extract meaning from single sentences and act.

After completing the setup process with the proper utterances, entities, and respective training, the API started to return decent results, even for sentences with vocabulary outside of the core. MotionNotes uses 262 utterance templates associated with 15 possible intents (See table 1).

6 USER STUDY

The study was designed and divided into two parts, each one to understand a different question. The user study one is about how MotionNotes with a speech module compares to MotionNotes without it, both in user satisfaction and performance. The second user study wants to understand how MotionNotes behaves in a classroom video presentation scenario, collecting feedback from the audience towards the design of mobile and ubiquitous environments.

The motivation behind the chosen design was the necessity to confirm, in the first place, if the system is stable enough when working with voice. Therefore, we organized a first user study where the participants performed activities with the manual interface followed by similar activities with the speech module. During these activities, we collected data manually and automatically about the participant's performance. It is essential to highlight that our intention is not to prove that the speech interface is superior to the manual since our goal is not to replace the manual methods. Instead, our focus is to augment the user's chances in terms of natural interaction to enable annotation in situations where the user needs physical mobility.

The inspiration for the second case study had roots in several factors. First, we selected situations where the manual annotation is challenging for the user. These circumstances where people need mobility between the computer, the projection and the audience to foster good communication include talks, work meetings, lectures, classroom teaching, or creative brainstorming. After evaluating each one of these scenarios, we decided to advance with the classroom case. This decision was supported by informal interviews with several teachers, whose experience told them that our solution could work very well in certain classes. The social circumstances that we lived in the last months also contributed since we had to transform our teaching methods and create materials for online classes. After the first lockdown, we had valuable material, including videos, and we would like to use it to improve our traditional teaching. Therefore, the classroom environment, the videos created to assist online classes, and the MotionNotes testing phase converged and resulted in an excellent opportunity that worked out very well.

6.1 User Study One

This first user study was done under unusual circumstances, which prevented the invited testers from being physically present in our laboratory due to safe health practices imposed by the rules against COVID-19.

A new testing procedure was created to mitigate and overcome this problem: first, a remote web conference software enabled the researchers and participants to communicate using video calls and share their screens while performing the required activities. Second, we also used cloud document sharing software to share instructions, task files, and evaluation questionnaires. Finally, since MotionNotes was not accessible outside our university network, we requested a production server to deploy our prototype for the testing days.

We invited 27 participants (11 female) from two different public universities. The participant's ages were between 18 to 44 years old. The users had different levels of expertise regarding video players and video annotation, with 52% of users who self-reported having limited experience with this type of software.

The first activity performed in the test procedure was to demonstrate our tool management and train the participants on using it. We shared our screens and presented the main actions that our tool supports, starting with the presentation of the manual interface and then the speech module. Afterwards, the participants were asked to share their screens and work freely with the tool for several minutes. During that time, our team provided support.

While sharing their screens, the participants were asked to complete 24 tasks divided into two groups (G1 – 18 tasks and G2 – 6 tasks); these tasks were created to be executed using different videos available with the application for test purposes. We assigned similar tasks with the manual interface and the speech module.

In the first group, users were always informed what the final annotation results should be, and they had to execute the steps with the tool to achieve those results. One example from G1 tasks is “Please identify the synchronization problem between the two dancers in video one, highlight the fact with a draw annotation (manual interaction)”. Another one is “There is a pronunciation problem when the teacher says the word python in video two; please add a sound annotation with the correct pronunciation (speech interaction)”. This G1 has a high number of tasks since it was essential to test most of the video annotation features, contrasting with the G2, where the objective was to observe the user working without restrictions. Therefore, we required fewer tasks.

The second group of tasks was intended to be a more open and creative video annotation work. We had short videos containing puzzle games with poor gaming performances. So, we invited the users to identify the mistakes and provide advice on how to improve performances using the tool. There were no restrictions regarding the event to annotate, the interface to use, or the annotation type. Users only had a time limit to complete the annotation work on each video.

Our team watched the shared screens during the task execution, taking notes of user actions and impressions. The participants’ feedback was collected after they completed their tasks, and they were asked to comment on their experiences with MotionNotes and encouraged to provide new suggestions and constructive criticism. Finally, our sessions ended with the participants completing a short questionnaire about their experiences with MotionNotes.

6.1.1 Task evaluation procedure. We defined a task evaluation method, using a scale of one to five points for each task. For instance, if one user applied one text annotation for a task that involved more annotations, a score of three was given, while a user who applied all annotations for the same task received the maximum score of five points. Users had a fixed time to accomplish each task, about two minutes. If a task was not finished during the period, the user's score was penalized, and we friendly started to provide hints to help them reach the objectives. That means all users completed each task, and the final score measured each user's success in completing it. The tool was modified to follow up all the user actions, both the manual and speech interfaces, to save detailed logs for each triggered operation.

6.1.2 Performance results. The study of the collected data demonstrated that 1446 commands were used during the tests, which provided us with a mean of 54 used per session. From the utilized commands, 51% were triggered using voice, demonstrating that users were using both interfaces. As expected, users with less experience required approximately 10% more commands to accomplish the objectives. The speech module usage varied from user to user. However, the percentage of tasks completed using the speech interface was higher for the participants with previous annotation experience (53.3%). They may have been more comfortable performing the tasks and wanted to explore this different interaction option.

Regarding the to-do task exercises, the success scores were positive (M:3.92; SD:0.65), with better scores observed in group task one (M:4.32; SD:0.42) than (M:3.23; SD:0.24) in group task two. The more conservative methodology used for the first group of tasks simplified achieving the objectives compared to group two, which used a more open testing procedure. The average success rate with the speech module was (M:3.96; SD:0.61) compared to (M:3.89; SD:0.74) for the manual version without the speech module. The results are almost identical, and the difference has a minor statistical impact. One possible justification for this is that the integration between the manual interface and the speech module was very smooth and careful, resulting in consistent scores when using our method for measuring success.

Regarding the voice commands invoked by participants, the number of words identified varies between one and five, with the average number of words per command being (M:1.8; SD:1.0). We examined a sample of 89 inputs intended to trigger the play video action to analyze speech module performance. We found 19 results with at least one incorrect word, which translates to 87% correct transformations; this indicates that our speech module achieved decent accuracy.

6.1.3 Qualitative results. The user satisfaction while using the application was positive, being highlighted in many comments. One user (U2) said: “Easy to learn, does not take much time to understand all the functionalities.” Another user (U17) was very pleased with the overall experiment: “I had no problems completing the proposed tasks either in the manual interface or voice interface. The two interfaces complement each other very well!”. Feedback from an additional user (U11) said that the speech module could be a suitable alternative to the manual interface. “The performance overall is good. There were scenarios where having the speech module helped find features when I did not know where they were on the ‘manual’ interface”.

The following three questions in this study were designed based on a Likert 5-point scale to analyze the system's usability and performance. This study implemented descriptive statistics and a paired-samples t-test for evaluating the scores (see Table 2).

Table 2: Descriptive statistics and t-test of evaluation result for MotionNotes speech module test.

Question	A^a	B	C	D	E	M	SD	t
1. Consider the software usability
1.1 Classify the interaction with only the manual interface	6	10	10	1	0	3.77	0.83	-2.054; p<0.05
1.2 Classify the experience with speech + manual interface	14	9	3	0	1	4.29	0.94	-2.054; p<0.05
2. Consider the software performance
2.1 Classify the interaction with only the manual interface	13	11	3	0	0	4.37	0.67	0.593; p>0.05
2.2 Classify the experience with the speech interface	11	12	4	0	0	4.26	0.69	0.593; p>0.05
3. Consider preferences when working with annotation software
3.1 Preference for manual interface	15	10	2	0	0	4.48	0.63	1.688; p>0.05
3.2 Preference to work with speech interface	12	12	3	0	0	4.33	0.66	1.688; p>0.05
^a A: strongly useful, B: useful, C: ok, D: not useful, E: strongly not useful.

Regarding software usability with and without speech module, most users answered that they preferred using both the manual and speech modules simultaneously. The t-test analysis confirmed that the values had significant differences.

The second question was about software performance with and without the speech module, and the feedback scores were very close. The t-test did not reveal significant differences. So, we can conclude that participants do not observe differences between the experience with speech interface and without it.

When asked about performing annotation work with speech or manual interface, the results showed a preference for manual interaction. After debating this question with participants, we concluded that this result is strongly connected to the previous question since the more experienced participants commented that manual keyboard shortcuts are essential for efficiency when performing annotation work.

During the test, we observed that participants switched from the speech interface to the manual interface in the following cases: 1) Needed to accomplish a detailed task, customizing or finding and tuning an object; 2) Wanted to explore the menus and different options offered by the manual interface. The opposite was observed when participants did not know where the functionality was in the manual interface, and they tried voice commands to achieve their objectives.

Another situation observed was a user abandoning the speech interface after experiencing a less accurate result. If the user did not find the correct sentence to activate the command, some frustration arose after one or two attempts, and the probability of switching to the manual interface increased. The participants also mentioned that sometimes they inadvertently clicked on the screen, causing undesired actions, such as accidentally adding annotations to incorrect locations. This problem can be solved by providing more feedback to users.

The last question in this first study was to reflect if they would recommend this tool to their friends or colleagues. The feedback was very positive, with 100% of participants reporting they would recommend MotionNotes.

6.2 User Study Two

The students participating in this second study were volunteers from a computer science course attending theoretical-practical classes with a two-hour duration. Testing procedures were planned to include a phase where the class topic would be introduced using a more traditional methodology (first hour) and another phase (second hour) where MotionNotes would be used. The phases order could be changed from class to class.

The traditional methodology consisted of showing PowerPoint slides with the topics, write on the board some examples, and concluding the topic with exercises about it. The methodology with video annotation tool consisted of showing videos about the topic, where the teacher was using a BlueTooth microphone, watching the videos with the audience, and controlling the playback with his voice. The teacher also had control to add text annotations remotely over the video. Those annotations were mainly questions, observations, and comments from both teacher and students due to debate about the topic. After that, the video and the respective annotations were available to the students under a direct web URL.

The researchers themselves created the videos for each topic presented in the study. The video clips were mainly desktop recording with voice done with OBS software, initially created to help with online classes. In those videos, we present the topics with code examples resulting in mini projects for each topic. The code was also available with the annotated videos.

For the testing classes, 13 students were present (four female). The students had similar education levels since they were completing the bachelor's degree; 16% of the participants were studying and working. More than 80% of the students were between 18 and 24 years old. Regarding participants’ habits, 84% use video content to understand and consolidate some class topics, and 67% use annotations to help their learning processes. It is also interesting to know that 92% of the students share annotations; the platform used for sharing this information is mainly instant messaging software, like WhatsApp, Skype, and Discord. Concerning speech-based technology, 62% of the participants are already using tools controlled by speech.

The students’ feedback was collected after they completed a couple of classes with this procedure. They were asked to comment on their class feelings while using MotionNotes and inspired to offer new ideas and constructive criticism about the experiment. Finally, we asked the participants to complete a short questionnaire where they had to answer 18 questions to understand the MotionNotes impact. In order to evaluate the suitability of MotionNotes in the context of learning with video content, this study was developed based on a Likert 5-point scale, including the system suitability and perceived usefulness. This study adopted descriptive statistics and a paired-samples t-test for analyzing the results (see Table 3).

Table 3: Descriptive statistics and t-test of evaluation result for MotionNotes classroom test.

Question	A^a	B	C	D	E	M	SD	t
1. Consider the course learning materials
1.1 Classify the learning material with video content	4	2	5	2	0	3.62	1.08	-0.185; p>0.05
1.2 Classify the traditional learning material	4	1	8	0	0	3.69	0.91	-0.185; p>0.05
2. Consider the classroom environment
2.1 Classify a class with video support	5	3	5	0	0	4.00	0.88	1.979; p<0.05
2.2 Classify a class without video support	2	3	6	2	0	3.38	0.92	1.979; p<0.05
3. Consider the video usage situation
3.1 Watching video learning materials during class	5	4	4	0	0	4.08	0.83	4.185; p<0.05
3.2 Watching video learning materials before class	0	2	7	2	2	2.69	0.91	4.185; p<0.05
4. Consider the video playback control during class
4.1 Voice remote commands	7	3	3	0	0	4.31	0.82	2.034; p<0.05
4.2 Traditional mouse, keyboard	3	3	5	2	0	3.54	1.01	2.034; p<0.05
5. Consider the act of add new video annotations during class
5.1 Adding remote annotations with voice commands	7	3	3	0	0	4.31	0.82	1.812; p<0.05
5.2 Traditional mouse, keyboard	3	4	4	2	0	3.62	1.00	1.812; p<0.05
^a A: strongly useful, B: useful, C: ok, D: not useful, E: strongly not useful.

In question one, the participants were asked to rate their learning materials in two different scenarios: learning materials with video content and without. The scores were very close, which resulted in a t-test with no significant difference.

Question two was about the classroom environment and if the participants enjoyed classes with video support or not. A preference for the methodology containing video content was observed, with a significant difference in the t-test.

In question three, students were asked to rate the video content usage if there is a preference to watch before or during class. There is a clear preference for working with these materials during class (m=4.08). The t-test also returned a significant difference.

The teacher's physical position freedom in the classroom was an essential topic in this paper, with questions four and five addressing this issue. The first one was about the capacity to control the playback flow in any position across the classroom, where the students recognize the value based on means and t-test results. The second was about the capacity of adding text annotation above video with voice, which also achieved good statistical scores and significant differences.

The last question was for the students to reflect if they think that it is important to have the capacity to add annotations during brainstorming/debate in a classroom or professional meetings when video content needs to be used. All the participants answered positively.

Overall, the feedback was positive, with a couple of comments from the students highlighting user satisfaction. One student said, “Having the teacher in the audience with us debating a new technique demonstrated in the video was a factor which increased my attention and desire to participate”. Another student said, “The text annotations created in key moments are significant for sharing opinions and collaboratively learning”. One additional comment worth mentioning was related to an after-class context “I think it is interesting that we can use these videos with annotations later to prepare for the final exams”.

7 FUTURE WORK

The user feedback from MotionNotes testing was generally positive; some interesting ideas were also obtained for further development and research work.

Although our speech module in this current version only supports English, supporting other languages could be a valuable feature to implement in future development. While running tests, we observed that users were sensitive to speech module response times, especially when the voice was sent to our server to be processed. The tool took slightly more time to engage the requested command (∼1 sec.); We are aware this issue is dependent on the user's Internet connection speed. Nonetheless, our server's performance and connection-level developments should continue in order to mitigate this issue. Acceptable outcomes were observed regarding speech module accuracy, but we think there is room for improvement in a future version. Since the WIT.AI platform always adds more vocabulary to suggest new utterances after each test session, our team will validate this new vocabulary to create new templates and connections between the new utterances, entities and the respective intents.

The feedback obtained regarding the collaborative experiment in a classroom was also positive. Nonetheless, a few aspects could improve the overall quality, such as 1) Videos created to be presented in the classroom need a higher zoom than others created to be watched on a regular device. 2) The sound of a regular computer is not enough to present in larger rooms, and then a more powerful sound system is needed. 3) The background noise will always reduce the accuracy of the voice commands so that noise reduction algorithms could benefit this speech module in a future version. 4) In this experiment, only the teacher had a microphone to interact with the system, which is appropriate in a classroom context. A future experiment could integrate a more sophisticated input sound system to enable each participant to add annotations with his voice. All these observations and requirements are important principles for designing a more natural user interaction towards smarter environments according to mobile and ubiquitous computing.

We chose the browser as our platform for MotionNotes due to compatibility, given the known availability on practically every workstation and smartphone. Although we did not restrict which electronic device could be used in our test program, all users selected laptops for the testing session. Exploring this prototype with tablets or smartphones would be interesting to determine if similar results could still be achieved.

Finally, as future research, we plan to widen the analysis in several directions. First, extend the video annotation usage to other courses, talks and collaborative creativity moments. Second, we plan to include the effect of assessment strategies in the study to see the influence on audience engagement. Third, we also want to research how gesture input can be integrated to better support video annotation users, providing them with another option to explore and combine with manual and speech interactions.

8 CONCLUSIONS

The web video annotation tool presented in this paper enables users to execute commands with both speech and manual interaction (e.g., touch, mouse, keyboard). This innovation resulted from several discussions and suggestions obtained during tests and meetings working with a previous version of this tool.

Technical developments enabled implementing the required interaction while taking advantage of novel artificial intelligence techniques on speech processing. The MotionNotes speech module architecture successfully processes voice input and facilitates multimodal interaction.

Video annotation tools to support education are becoming common in multiple learning experiences. Using MotionNotes to present course topics with video positively influences class students’ participation and concentration. The tool enables the speaker to be more effective since it supports control from a distance using a wireless sound input device. Moreover, this study's results highlight the importance and need for technology-supported teaching strategies to achieve better educational creativity and productivity.

ACKNOWLEDGMENTS

This work is funded by Fundação para a Ciência e Tecnologia (FCT) through a Ph.D. Studentship grant (2020.09417.BD). It was supported by the project WEAVE, Grant Agreement Number: INEA/CEF/ICT/A2020/ 2288018; and the project CultureMoves, Grant Agreement Number: INEA/CEF/ICT/A2017/ 1568369. It is also supported by NOVA LINCS Research Center, partially funded by project UID/CEC/04516/ 2020 granted by Fundação para a Ciência e Tecnologia (FCT).

REFERENCES

A. Qaffas, A. 2019. Improvement of Chatbots Semantics Using Wit.ai and Word Sequence Kernel: Education Chatbot as a Case Study. International Journal of Modern Education and Computer Science. (2019). DOI:https://doi.org/10.5815/ijmecs.2019.03.03.
Amazon Alexa Voice Assistant | Alexa Developer Official Site: 2021. https://developer.amazon.com/en-US/alexa. Accessed: 2021-01-24.
Augstein, M. et al. 2019. WeldVUI: Establishing Speech-Based Interfaces in Industrial Applications. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019).
Bellegarda, J.R. 2014. Spoken Language Understanding for Natural Interaction: The Siri Experience. Natural Interaction with Robots, Knowbots and Smartphones.
Cabral, D. et al. 2011. A creation-tool for contemporary dance using multimodal video annotation. MM’11 - Proceedings of the 2011 ACM Multimedia Conference and Workshops (2011).
Cabral, D. and Correia, N. 2017. Video editing with pen-based technology. Multimedia Tools and Applications. (2017). DOI:https://doi.org/10.1007/s11042-016-3329-y.
Cambria, E. and White, B. 2014. Jumping NLP curves: A review of natural language processing research. IEEE Computational Intelligence Magazine.
Camtasia: 2001. https://www.techsmith.com/video-editor.html. Accessed: 2021-06-02.
Carney, M. et al. 2020. Teachable machine: Approachable web-based tool for exploring machine learning classification. Conference on Human Factors in Computing Systems - Proceedings (2020).
Chang, M. et al. 2019. How to design voice based navigation for how-to videos. Conference on Human Factors in Computing Systems - Proceedings (2019).
Christina, C. et al. 2017. Powerpoint Controller using Speech Recognition.
Cohen, P.R. and Oviatt, S. 2017. Multimodal speech and pen interfaces. The Handbook of Multimodal-Multisensor Interfaces: Foundations, User Modeling, and Common Modality Combinations - Volume 1. (2017), 403–447. DOI:https://doi.org/10.1145/3015783.3015795.
Colasante, M. and Douglas, K. 2016. Prepare-participate-connect: Active learning with video annotation. Australasian Journal of Educational Technology. 32, 4 (Nov. 2016), 68–91. DOI:https://doi.org/10.14742/ajet.2123.
Cortana - Your personal productivity assistant: 2021. https://www.microsoft.com/en-us/cortana. Accessed: 2021-01-24.
Frame.io: 2021. https://www.frame.io/. Accessed: 2021-05-25.
Fraser, C.A. et al. 2020. ReMap: Lowering the barrier to help-seeking with multimodal search. UIST 2020 - Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (2020).
Furui, S. 2005. 50 Years of Progress in Speech and Speaker Recognition Research. ECTI Transactions on Computer and Information Technology (ECTI-CIT). 1, 2 (Jan. 2005), 64–74. DOI:https://doi.org/10.37936/ECTI-CIT.200512.51834.
Gao, T. et al. 2015. Datatone: Managing ambiguity in natural language interfaces for data visualization. UIST 2015 - Proceedings of the 28th Annual ACM Symposium on User Interface Software and Technology (2015).
Gašević, D. et al. 2014. Analytics of the effects of video use and instruction to support reflective learning. ACM International Conference Proceeding Series (2014), 123–132.
Gerard, C. and Gerard, C. 2021. TensorFlow.js. Practical Machine Learning in JavaScript.
Goldman, R. et al. 2014. Video Research in the Learning Sciences. Routledge.
Juang, B.H. and Rabiner, L.R. 2004. Automatic Speech Recognition – A Brief History of the Technology Development. Elsevier Encyclopedia of Language and Linguistics. (2004).
Kang, R. et al. 2019. Minuet: Multimodal interaction with an internet of things. Proceedings - SUI 2019: ACM Conference on Spatial User Interaction (2019).
Kipp, M. 2012. Multimedia Annotation, Querying, and Analysis in Anvil. Multimedia Information Extraction: Advances in Video, Audio, and Imagery Analysis for Search, Data Mining, Surveillance, and Authoring. (Aug. 2012), 351–367. DOI:https://doi.org/10.1002/9781118219546.CH21.
Laput, G. et al. 2013. PixelTone: A multimodal interface for image editing. Conference on Human Factors in Computing Systems - Proceedings (2013).
Lemon, N. et al. 2013. Video annotation for collaborative connections to learning: Case studies from an Australian higher education context. Cutting-Edge Technologies in Higher Education. 6, PARTF (2013), 181–214. DOI:https://doi.org/10.1108/S2044-9968(2013)000006F010.
López, G. et al. 2018. Alexa vs. Siri vs. Cortana vs. Google Assistant: A Comparison of Speech-Based Natural User Interfaces. Advances in Intelligent Systems and Computing (2018).
McShane, M. 2017. Natural language understanding (NLU, not NLP) in cognitive systems. AI Magazine. (2017). DOI:https://doi.org/10.1609/aimag.v38i4.2745.
Mehler, B. et al. 2016. Multi-modal assessment of on-road demand of voice and manual phone calling and voice navigation entry across two embedded vehicle systems. Ergonomics. (2016). DOI:https://doi.org/10.1080/00140139.2015.1081412.
Mitrevski, M. and Mitrevski, M. 2018. Getting Started with Wit.ai. Developing Conversational Interfaces for iOS.
ml5js·Friendly Machine Learning For The Web: 2021. https://ml5js.org/. Accessed: 2021-01-25.
MotionNotes: 2019. https://motion-notes.di.fct.unl.pt/. Accessed: 2020-05-16.
Mouza, C. and Lavigne, N.C. 2013. Introduction to Emerging Technologies for the Classroom: A Learning Sciences Perspective. Emerging Technologies for the Classroom: A Learning Sciences Perspective. (Jan. 2013), 1–12. DOI:https://doi.org/10.1007/978-1-4614-4696-5_1.
Oviatt, S. and Cohen, P. 2000. Perceptual user interfaces: multimodal interfaces that process what comes naturally. Communications of the ACM. (2000). DOI:https://doi.org/10.1145/330534.330538.
Pardo, A. et al. 2015. Identifying learning strategies associated with active use of video annotation software. ACM International Conference Proceeding Series (Mar. 2015), 255–259.
Qattous, H. et al. 2016. Teachme, A Gesture Recognition System with Customization Feature.
Radziwill, N. and Benton, M. 2017. Evaluating quality of chatbots and intelligent conversational agents. arXiv.
El Raheb, K. et al. 2017. BalOnSe: Temporal aspects of dance movement and its ontological representation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2017).
Ribeiro, C. et al. 2016. 3D annotation in contemporary dance: Enhancing the creation-tool video annotator. ACM International Conference Proceeding Series (2016).
Rich, P.J. and Hannafin, M. 2009. Video Annotation Tools Technologies to Scaffold, Structure, and Transform Teacher Reflection. (2009). DOI:https://doi.org/10.1177/0022487108328486.
Risko, E.F. et al. 2013. The collaborative lecture annotation system (CLAS): A new TOOL for distributed learning. IEEE Transactions on Learning Technologies. 6, 1 (2013), 4–13. DOI:https://doi.org/10.1109/TLT.2012.15.
Rodrigues, R. et al. 2019. Multimodal Web Based Video Annotator with Real-Time Human Pose Estimation. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 11872 LNCS, (2019), 23–30. DOI:https://doi.org/10.1007/978-3-030-33617-2_3.
Singh, A.K. et al. 2020. Voice Controlled Media Player: A Use Case to Demonstrate an On-premise Speech Command Recognition System. Communications in Computer and Information Science. 1209 CCIS, (2020), 186–197. DOI:https://doi.org/10.1007/978-981-15-4828-4_16.
Singh, V. et al. 2011. The choreographer's notebook-a video annotation system for dancers and choreographers. C and C 2011 - Proceedings of the 8th ACM Conference on Creativity and Cognition (2011).
Siri - Apple: 2021. https://www.apple.com/siri/. Accessed: 2021-01-24.
Smilkov, D. et al. 2019. Tensorflow.JS: Machine learning for the web and beyond. arXiv.
DE SOUSA, L. et al. 2017. The effect of multimedia use on the teaching and learning of Social Sciences at tertiary level: a case study. Yesterday and Today. 17 (2017), 1–22. DOI:https://doi.org/10.17159/2223-0386/2017/n17a1.
Srinivasan, A. and Stasko, J. 2018. Orko: Facilitating Multimodal Interaction for Visual Exploration and Analysis of Networks. IEEE Transactions on Visualization and Computer Graphics. 24, 1 (2018), 511–521. DOI:https://doi.org/10.1109/TVCG.2017.2745219.
Stevens, R. et al. 2002. VideoTraces. (2002).
Tang, D. et al. 2017. EFFECTIVENESS OF AUDIO-VISUAL AIDS IN TEACHING LOWER SECONDARY SCIENCE IN A RURAL SECONDARY SCHOOL. Asia Pacific Journal of Educators and Education. 32, (2017), 91–106. DOI:https://doi.org/10.21315/apjee2017.32.7.
Turk, M. 2014. Multimodal interaction: A review. Pattern Recognition Letters.
Vimeo: 2021. https://vimeo.com/features/video-collaboration. Accessed: 2021-06-05.
Wipster | Review Software: 2021. https://wipster.io/. Accessed: 2021-06-15.
Wittenburg, P. et al. 2006. ELAN: A professional framework for multimodality research. Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2006 (2006).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

MUM 2021, December 5–8, 2021, Leuven, Belgium