Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Generalisable Dialogue-based Approach for Active Learning of Activities of Daily Living

Published: 11 September 2023 Publication History

Abstract

While Human Activity Recognition systems may benefit from Active Learning by allowing users to self-annotate their Activities of Daily Living (ADLs), many proposed methods for collecting such annotations are for short-term data collection campaigns for specific datasets. We present a reusable dialogue-based approach to user interaction for active learning in activity recognition systems, which utilises semantic similarity measures and a dataset of natural language descriptions of common activities (which we make publicly available). Our approach involves system-initiated dialogue, including follow-up questions to reduce ambiguity in user responses where appropriate. We apply this approach to two active learning scenarios: (i) using an existing CASAS dataset, demonstrating long-term usage; and (ii) using an online activity recognition system, which tackles the issue of online segmentation and labelling. We demonstrate our work in context, in which a natural language interface provides knowledge that can help interpret other multi-modal sensor data. We provide results highlighting the potential of our dialogue- and semantic similarity-based approach. We evaluate our work: (i) quantitatively, as an efficient way to seek users’ input for active learning of ADLs; and (ii) qualitatively, through a user study in which users were asked to compare our approach and an established method. Results show the potential of our approach as a hands-free interface for annotation of sensor data as part of an active learning system. We provide insights into the challenges of active learning for activity recognition under real-world conditions and identify potential ways to address them.

1 Introduction

Accurate Human Activity Recognition (HAR) is often essential in enabling intelligent and autonomous services in Ambient Intelligence (AmI) domains such as smart homes and Ambient Assisted Living (AAL) environments. Many approaches have focused on supervised learning, to train models that can then be used to predict the activities of the user. However, such approaches cannot cope with long-term changes in user behaviour, physical environment, or sensors, all of which are normally occurring events when these systems are applied in real-world settings.
Supervised methods alone are difficult to apply at scale, since gathering and labelling training data can be a costly process. A key challenge is how to collect labels conveniently while maintaining accurate labelling.
These issues may be partly addressed with Human-in-the-Loop (HITL) annotation, where users annotate their own training data through Active Learning (AL) [14]. The appeal of HITL annotation for activity recognition is clear, as the user is usually the most knowledgeable source of information about their own activities. However, the reliability of their annotation may degrade when asked to annotate retrospectively due to the effect of “recall bias.” Existing research on memory-based surveys has shown that individuals may be less able to accurately recall events that are routine or mundane [5, 56]; naturally, this might include activities of daily life.
In active learning approaches, there is the question of how to query the user. To date, there has been a tendency to lean toward graphical interfaces, especially smartphone-based apps as mobile technology has advanced (such as in Reference [36]). There are obvious drawbacks to a graphical interface as a primary mode of labelling, particularly: (i) the user must have access to the interface at all times, e.g., they must carry their phone to be able to provide labels; and (ii) it is not convenient to use while performing an activity (remove the phone from pocket, open app, etc.).
Natural language-based annotation is an alternative, enabling hands-free annotation. However, existing approaches tend not to be integrated with an active learning mechanism and instead seek to have the user log and annotate every activity undertaken, rather than query only a valuable subset, as is the case with “Vocal-Diary” by Hoque et al. [22]. Approaches tend to be rigid in that they match exactly a user utterance with one of a set of possible labels, lacking any tangible dialogue model, as in Reference [52].
Crucially, existing approaches are typically designed to be in service during data collection campaigns, for the gathering of training data. They are not designed for long-term active learning.
In this article, we present our approach to natural language-based data annotation for long-term application in an active learning activity recognition system. The vision for which is illustrated in Figure 1. We combine a dialogue system with a knowledge base of semantic descriptions of a range of common Activities of Daily Living (ADLs) and a semantic similarity measure comparing vectorised user-provided labels to sample word vectors describing each ADL. This enables conversational annotation of samples chosen for labelling in an active learning process. We envision that these conversations can be carried out by an artificial (e.g., a voice assistant) or embodied agent (e.g., a companion robot) in the home environment, such that the user can provide a label with relative ease. In this case, we assume that activity recognition is being performed by online classifiers using data from a variety of sensors and information sources around the home.
Fig. 1.
Fig. 1. Our vision for dialogue-based annotation of daily activities in the home. Here, the user is interacting with an agent, i.e., a Toyota Human Support Robot (HSR) in the figure, who has been prompted to query the user based on information from the activity recognition system. Shown are three activity recognition models ( \(\theta _1\) , \(\theta _2\) , \(\theta _3\) ) each predicting a possible label (from set of labels \(\mathcal {L}\) ) for the current activity.
There are additional challenges for labelling activities using speech, which are not addressed in existing work, but which we try to address in our approach. First, by using active learning, we are able to examine the issue of how frequently we allow the system to interrupt the user. In contrast to existing speech-based annotation methods [32, 52], we seek to maximise comprehension while offering a more diverse range of labels by allowing the user to label in their own words and have it matched automatically to the relevant ADL label.
Our knowledge base includes possible descriptions of ADLs, daily routines, and answers to potential questions about a user’s activities when posed by a virtual agent. This dataset is a combination of common knowledge answers, which has been expanded with responses from survey participants. We have used this data in both the design and validation of our approach.
Our dialogue system and natural language pipeline are used together to extract an activity label matching one of those in our ADL bank, from interaction with the user. An activity recognition system can initiate a dialogue by sending a request to our dialogue system at any time, ideally including as arguments the most probable ADL labels, which allows for more natural dialogue when querying in certain circumstances, e.g., when a model predicts two options as (almost) equally likely. We also allow the user to manually add ADL training data to the system by starting and stopping sample recording using natural language commands, and subsequently label that sample using the labelling pipeline.
Our approach allows class labels for any given dataset/model to be related to the bank of common ADLs through a simple mapping mechanism described in this article, which enables re-use of the common dialogue system.
This article is an extension of our previous work in Reference [48]. In that paper, we applied our approach to a popular CASAS dataset [15],1 creating a committee of learners (models) and applying a Query by Committee selection strategy (maximum disagreement via Kullback-Leibler divergence [44]) to select samples for labelling. This scenario allowed us to consider active learning with the dialogue system over the long-term. We had evaluated our approach on existing datasets, but we had not explored the challenges related to online active learning in real settings.
We include this prior work here, as one example of our approach being applied in practice. We now also apply our approach to an online activity recognition system in our testbed, the Robotic-assisted Living Testbed (RALT) at Heriot-Watt University,2 where we utilise a hybrid activity recognition approach combining some domain knowledge and a Markov Logic Network (MLN) model. In this case, we deal with the practicalities of having to segment the stream of sensor data and generate queries while the user is still performing an activity.
In this article, we detail the interface of our dialogue system and how to use it. We hope that this will encourage others to re-use aspects of our approach and to create other convenient interfaces for active learning that are likewise re-usable.
For clarity, our original paper included: (i) a “natural language descriptors of ADLs” dataset, which contains a bank of English language descriptors for 33 common ADLs; (ii) a method of extracting and grounding utterances to the bank of ADL descriptors using semantic similarity measures; (iii) a reusable dialogue system that enables the application of (ii) to a given activity recognition active learning scenario; (iv) an example application of the dialogue system to a simulated active learning scenario using an existing dataset, where queries are generated based on disagreement between a committee of learners; and (v) a user study evaluation comparing our proposed method to an existing method.
In addition to the existing material, this article contributes the following: (i) a more in-depth review of the related work, now including a discussion of online segmentation of activities, which is essential to enabling accurate labelling; (ii) expansion of the dialogue system that affords users the ability to use the dialogue system to manually add and label new training data, thus enabling the user to build an activity recognition model “from scratch”; and (iii) a new example application of the dialogue system, this time applied to a single-learner online MLN-based activity recognition system that generates queries based on time-series segmentation. Here, we can demonstrate the dialogue system as a tool to aid the user in manually adding training data and labels, and can explore more the implications of using our approach in a real-world use case.
We present an approach that is novel in its use of natural language as an annotation method, the use of semantic descriptions to relate class labels with user labels, and the use of a dialogue system to enable more intelligent label extraction. We also provide insights more broadly into the practical application of active learning for activity recognition, and on how to enable short- and long-term active learning in the home to minimise the need for supervised learning.
The rest of this article is structured as follows: Section 2 provides background on segmentation and active learning in activity recognition, and the use of natural language for annotation; Section 3 details the problem we aim to solve with our dialogue system and introduces necessary symbols; Section 4 details our approach to solving the problem and provides some technical evaluation; Section 5 briefly discusses the generality of our approach and introduces two use cases; Sections 6 and 7 detail those two use cases applying our interface to different active learning scenarios, the first in a simulated scenario demonstrating long-term learning and the second an online scenario with close-to real-time activity recognition—methodology and evaluation are included in each; Section 8 provides end-user evaluation of our approach compared to an existing method; Section 9 discusses the practicalities of active learning for activity recognition in the home, considering limitations of our work and avenues for future research; Section 10 considers the societal impact of our work; and finally, Section 11 summarises our contributions and findings.

2 Related Work

In this section, we consider the issue of online segmentation in activity recognition before discussing state-of-the-art active learning approaches, with an emphasis on those relying on dialogue and/or natural language interfaces.

2.1 Online Segmentation

Data segmentation is an important prerequisite of online activity recognition. The segmentation function is responsible for selecting the data that is used to make predictions, which are then used in query selection. Furthermore, the labels extracted from the user are ultimately applied to the segments selected by this same function. Therefore, segmentation directly impacts the active learning process. Inaccurate segmentation, which includes data not relevant to the activity labelled by the user, can add noise to the training data for individual classes and may lead to the degradation of the model accuracy. There is a sizeable body of work on the topic of offline segmentation, wherein algorithms are applied to already labelled historical data; however, this is not relevant to real-time activity recognition, and so we focus only on online approaches.
In many approaches to activity recognition, it is common to see “fixed window” approaches to segmentation, where a fixed length of window (either fixed number of sensor events or a fixed period into the past) is used to make predictions [29, 53, 54, 60]. The drawback of these approaches is obvious: activities vary in length and samples within-class likewise may vary in event composition and duration.
An improvement on this is “dynamic windows,” which are windows that vary in size and the number of past events they capture based on some heuristic or intelligent process. They may consider activity duration [39, 50], sensor state variations [40], or contextual factors such as location [20]. Most state-of-the-art approaches combine dynamic windows with other measures such as weights to improve the relevancy of sensor events within the window before making a prediction.
In USMART, events are added to a continuous segment so long as they surpass thresholds for temporal closeness and sensor event similarity (a semantic measure that considers location, object the sensor is attached to, etc.), to give a similarity score [55]. For example, “Fridge” is similar to “Freezer,” but dissimilar to “BathroomDoor.” If the thresholds are not met, then the segment is finalised and a new one begins.
Krishnan and Cook detail in Reference [28] a segmentation approach where window-based segmentation is used as a baseline, with each segment containing an equal number of sensor events. They then apply measures that alter the weights of the events in the segment. This includes time-based measures, where a temporarily distant event in the past could have a lower weight than recent events that are closer together. They also consider sensor overlap, by calculating the correlation between sensor events offline from observation of the data. The calculated correlations are used during segmentation to check the relevance of sequential events to each other.
We see in newNECTAR a similar approach, with rules put in place to determine when a segment is finalised [12]. In this case, the authors apply a number of semantic rules that consider time constraints, the objects interacted with (i.e., if the user stops interacting with the objects in the segment), and change of event location.
In subsequent work, POLARIS, the authors combine probabilistic and knowledge-based measures to perform segmentation [13]. Expanding on those in newNECTAR, there are measures for “change of context.” An ontology is used to correlate the latest sensor event to the previous one, and “consistency likelihood” is calculated based on whether the events in the ongoing segment correlate to the possible activities.
In our online activity recognition approach, we do not use fixed windows and instead consider a change of location and the relevancy of the sensor event to the prevailing activity prediction for that segment. As will be discussed further later, the use of active learning allows the system to avoid committing to a single segmentation decision right away, instead allowing the user response to influence the segmentation.

2.2 Active Learning

Important questions to address in an active learning approach are: (i) What is the query strategy employed? and (ii) What is the mechanism of gathering labels from users? We define a query strategy as any process used to determine which data points in sequential data should be selected for labelling by the user.
There are existing examples of effective HITL annotation applied to HAR in the literature, such as by Stikic et al. [49], Liu et al. [31], and Miu et al. [36]. Stikic et al. evaluate a “least confident” query selection approach, in which the confidence of a single classifier is used to select a sample for labelling [49]. Liu et al. apply active learning to “bootstrap” HAR models for new users, with the help of a mobile annotation app. This alleviates some of the burden of user-driven annotation, by dramatically reducing the number of samples to be annotated. Specifically, only those found to be of high uncertainty are selected through some query selection process such as Query by Committee (QbC) are selected [59].
The same authors demonstrate a QbC approach, using two classifiers is also demonstrated, where samples are selected from only those with disagreeing labels. The method used to extract the labels from users is not specified.
In addition to the least confident method, Alemdar et al. also consider: (i) “margin sampling,” where they select “instances [where] the difference between the most and the second most probable labels is minimum”; and (ii) “expected entropy” [1], defined by Bagaveyev and Cook as selecting “points to label on the basis of the highest information gain that is expected to result from the selection of a data point from an unlabelled data pool, irrespective of the label that will be assigned to the data point” [3]. The authors found that all three methods outperformed random selection of samples for annotation across several testbed homes.
Expected entropy and QbC (based on random forests) are also investigated in Reference [3]. Both methods are shown to reach peak accuracy before random selection strategies, although they may be temporarily disadvantaged early on.
While the label extraction method is not identified in Reference [1], it is noted in Reference [3] that annotation is carried out through a web interface that shows the annotator information such as time, location, and power usage.
Many approaches building on these early implementations of active learning for HAR take advantage of (unsupervised) clustering of some form [18, 23, 24, 25, 37], generally based on k-means. Clustering is often used to batch together samples that appear similar to each other, using some similarity measure, e.g., they may share similar duration, sensor events, or raw numerical data. Clustering therefore serves to enable the annotation of several samples at once by asking the user to label an entire cluster.
Few approaches give focus to the extraction of labels from users or to the query strategy for the generated clusters, i.e., when and how to ask the user. Zheo et al. also used clustering, but with a specific intention of reducing labelling error in the use case of crowdsourcing labels via the Amazon Mechanical Turk platform, rather than from end-users/targets of the classifier in question [58].
Civitarese et al. describe a collaborative transfer learning system that semantically relates feedback across individual homes to an ontology representing activities, events, and objects [11]. While not addressing them directly, the authors highlight difficulties relating to: interacting with the user to gather labels, taking into consideration “the current mood of the subject and whether he/she can be currently interrupted”; and the user interface used during such interactions, giving a nod to voice-based (i.e., natural language) approaches as “particularly suitable for elderly subjects.”
It is worth noting that in existing state-of-the-art active learning approaches, such as in References [12, 13], the segment is finalised before querying the user. With our approach, we highlight how querying while an activity is taking place, i.e., when the segment to which the label will be applied is still developing, allows the response of the user to influence the segmentation itself.
Our approach utilising a dialogue system and natural language is well suited to an active learning application, and in particular to take advantage of an informed query strategy to initiate dialogues with the user only when necessary. One of the key aims of our approach is ultimately to reduce the amount of burden for the user, by providing an interaction mechanism that requires little initiative from the user and is straightforward to use even while executing an activity.

2.3 Natural Language Annotation

End-user annotation of activities to gather training data is demonstrated by Kasteren et al. in Reference [52], in which users say aloud one of a fixed set of commands to denote the start and end of an activity, e.g., “‘begin use toilet’ and ‘begin take shower’.” In this case, the users annotate smart home sensor data and the system requires an exact match between the processed speech and the accepted labels. A similar approach is adopted by Singh et al. in Reference [47], though the authors do not specify the mechanism used to match the user’s speech with the accepted labels.
Most recently, Mairittha et al. [32] implemented a dialogue system for user-driven annotation of a fixed set of activities identifiable with accelerometer data (e.g., walking, cycling, and sitting). Our approach likewise implements a dialogue system to annotate activity data, but within a smart home context, and with a semantic similarity element that allows the mapping of a common set of activities (ADLs) to any number of target classification approaches.
Grounding remains an active topic in Natural Language Understanding (NLU) research and so it is important to give context to specific challenges that may be faced in the home and in relation to ADLs. Vacher et al. detail how the “Sweet-Home” project implements NLU in a smart home context to process simple (pre-defined) commands, e.g., “turn on the light” [51]. The system aims to detect fixed phrases, but there is novelty in the use of context (e.g., location) to ground commands to the correct target, i.e., extending the previous example: to select the correct light.
Jurafsky and Martin introduce the Generalised Grounding Graph ( \(G^3\) ) framework in Reference [26], which grounds natural language expressions to a probabilistic grounding graph, given appropriate semantic models of the environment. The framework aims to “find the most probable groundings \(\gamma _1\ldots \gamma _N\) given a parsed natural language command \(\Lambda\) and the robot’s model of the environment, \(M\) ...” Knepper et al. utilise the framework in reverse (inverse semantics) for the use case of Natural Language Generation (NLG), in this instance, to generate natural requests for help from a utility robot [27]. In doing so, they demonstrate how the grounding graph can be used to derive appropriate context for the generation of queries. Misra et al. likewise utilise a (directed) graph to ground natural language [35]. In contrast, our approach is less rigid than graph-based approaches in that we do provide a fixed set of ADLs, but use a data-driven approach to ground new ADL labels through semantic similarity of word vectors—both the set of ADLs and the bank of descriptors can be easily expanded.
Alomari et al. demonstrate language-based learning about faces, objects, and colours in Reference [2]. The approach uses natural language descriptions of video frames gathered from Amazon Mechanical Turk: grammar trees are generated using the Natural Language Toolkit (NLTK), from which verbs, nouns and adjectives can be extracted, and then grounded based on correlation between unique words and concepts in each frame of video footage.
The earliest approaches (such as by Ho Lee et al. [19]) to semantic similarity relied simply on edge-counting, in which the distance between two concepts in a directed graph was assumed to be a valid measure of their similarity. The purpose of semantic similarity as it applies to one or more concepts in a knowledge base, or taxonomy, is specified by Resnik et al. as a measure of “the extent to which they share information” [43]. Resnik et al. detail an information theory-based approach, in which class relationships (subsumption) dictate the information content of individual nodes, to ultimately find a common parent to the given concepts. More recently, with the prevalence of tools that can transform words into vector representations (such as “Word2Vec” [34]), it has become possible to compare similarities in the vector space. Natural language tools, such as SpaCy [21], now often incorporate such tools to provide neural network-based semantic similarity measures trained on large datasets. Our approach takes advantage of word embedding to assess similarities between new user-provided labels, and a bank of existing natural language samples that have already been assigned a label.

3 Problem Definition and Symbols

Activity Recognition. In a general activity recognition system that uses active learning to improve performance, we have: sensor event data \(E\) , a learner \(\theta\) , and a label set \(\mathcal {L} = \lbrace \mathcal {L}_1,\ldots , \mathcal {L}_m\rbrace\) , where \(m\) is the number of potential labels. At every time step \(t\) , the learner takes sensor events \(E_t \in E\) as input, to generate a confidence-ranked list of probabilities across \(\mathcal {L}\) , i.e., \(P_t = P(\mathcal {L}_1\ldots \mathcal {L}_m|E_t)\) , where \(\sum ^m_{i=1} P(\mathcal {L}_i|E_t) \le 1\) .3
Data. If \(E\) represents all sensor data seen by the system, then in an online active learning system only the training data is known, which we will call \(E_K\) , while all other data points encountered while online are unknown, and will be called \(E_U\) . Data can only move from \(E_U\) to \(E_K\) when it has been labelled by a suitable teacher, i.e., a human or an appropriately intelligent artificial agent.
Query Selection. At every time step \(t\) , it must be decided: (i) whether to request that the dialogue system issue a query, which generally involves analysing values from \(\mathbb {P}: \lbrace P_0,\ldots , P_t\rbrace\) via some function \(s : \mathbb {P} \rightarrow \lbrace 0, 1\rbrace\) ; and (ii) if querying, which portion of the data the label should be applied to. All data selected for labelling is known as \(E_{C}\) . The specifics of the query selection process depend on factors specific to the activity recognition approach, as we will demonstrate in the methodology for our two different use cases, where further notation is introduced as appropriate.
Dialogue System. The problem then is to design a dialogue system that can: (i) generate appropriate queries based on \(E_{C}\) , from a set of possible queries \(Q\) ; and (ii) ground the user responses to grounding variables \(\Gamma = \lbrace \gamma _1\ldots \gamma _j\rbrace\) , where each grounding \(\gamma\) represents an ADL.
For each sample selected for labelling \(e_{C} \in E_{C}\) , the system must generate some query \(q \in Q\) and should receive some response \(\lambda\) . From that response, the system must find the most probable ADL grounding:
\begin{equation} \underset{\lambda _j}{argmax} \: P(\gamma _1\ldots \gamma _j | \lambda). \end{equation}
(1)
If the label set of the model is not equal to the set of possible ADL groundings, then they are mapped through some function \(f : \mathcal {L} \mapsto \Gamma\) , where the label ultimately applied to \(E_C\) is known as \(\mathcal {L}_t\) .

4 Dialogue System

Our approach is to provide a natural language interface through which users can give simple commands and provide ADL annotations to an activity recognition system. First, we present the dialogue system as a standalone component that can be re-used with different activity recognition systems. The dialogue system supports system-initiated interactions for active learning, and user-initiated interactions for manual training.
Using the dialogue system, users can be queried about their activities. From responses to those queries, user-provided labels can be extracted and assessed for semantic similarity against a provided set of natural language descriptors of common ADLs, and related to the target dataset through a simple mapping function. The conceptual architecture of our system is shown in Figure 2.
Fig. 2.
Fig. 2. Architectural overview. Purple blocks are software components, yellow circles are models and data structures, and black blocks are components external to the system. It is assumed that the agent/device (green) interacts with the human user. This architecture represents a standalone dialogue system that can integrate with an external activity recognition system to provide an active learning interface. Aspects of this architecture are elaborated further in the text.
In summary, the architecture of the dialogue system contains all of the components required to interact with the user (Natural Language Understanding and Generation) as well as components required to handle system-initiated active learning and user-driven commands. While the primary purpose is to initiate dialogues, users can give some commands to the activity recognition system. For learning purposes, the user may wish to manually add a segment of training data, denoting the start and end of the activity, while still utilising our pipeline for ADL labelling.

4.1 ADL Descriptors Dataset

As part of our approach, we created and utilised a dataset containing a set of natural language descriptions of 33 common ADLs (shown in Figure 4). We collected responses via a multi-stage online survey (via Qualtrics XM) in which respondents were asked to respond to questions as if they were being asked by a virtual assistant in their home. A brief description of the data collection is described here, and a copy of the survey is preserved on our website for those particularly interested in the data collection.4
Fig. 3.
Fig. 3. Architecture of a Query by Committee approach to active learning for activity recognition. In this architecture, a committee of learners (models) provide simultaneous predictions upon the data. Query selection is then performed based on whether the learners agree and to what extent. Labels are applied to the discrete samples used for prediction. Purple blocks are software components, yellow circles are models and data structures, black blocks are components external to the system, and red diamonds are decisions points.
Fig. 4.
Fig. 4. Mapping of model labels for CASAS (in deep blue) to the common ADLs (purple) that appear in our dataset. Note that the “Enter Home” class included in the model is not shown, since it occurs infrequently and in our experiments was mapped back to the “Other” class.
In the first stage of the survey, we gathered ADL descriptions by presenting a series of activity labels (e.g., “relaxing”) and asked respondents to describe each activity in their own language. We allowed for and encouraged multiple descriptions for each activity.
In the next stage, we introduced the notion of a virtual/embodied agent asking specific questions to the user during daily life. This is to understand better how users might structure their language when answering such questions. Here, we presented scenario questions with examples of agent speech, such as: “You are eating lunch at the dining table, while a television is on, when you are asked: ‘I’m having trouble working out exactly what you are doing just now, but I think you are either watching TV or eating lunch? Can you confirm which is correct?’
In addition to our own responses, we collected valid responses from eleven individuals. We removed duplicates and sanitised responses to remove any information that may personally identify a participant, leaving a total of 390 descriptors over 33 common ADLs (including “other”). We have made this publicly available.5
We have provided common sense coverage of activities, but have also ensured to capture those featured in the most popular activity recognition datasets (e.g., CASAS [15] and Kasteren [52]), and those specified by the Katz Index [46]—with the exception of “transferring,” which is an implied activity in sequential activity recognition, and “continence,” which is not usually gleaned from activity recognition.
The dataset covers general high-level activities such as “relaxing” and “working,” as well as more granular activities such as “brushing teeth” and “taking medication.” There are no verbatim duplicates, but similar phrases with minor variations of wording are included. An extract from the dataset is shown in Table 1.
Table 1.
ADLDescriptors
relaxingrelaxing, sitting doing nothing, chilling, ...
workingdoing work, working at my desk, doing paperwork, ...
studyingstudying, preparing for an exam, doing homework, ...
......
Table 1. Example of Natural Language Descriptors of ADLs Extracted from Our Dataset
Each ADL has an array of sample descriptors. Key descriptors are highlighted in bold. For a complete list of ADLs in our dataset, see Figure 4.

4.2 Generality and Reusability

We focus on a generalisable approach for matching user-provided labels with one of our common ADLs from our dataset. We utilise a SpaCy [21] English language model and semantic similarity tools, primed with a pool of common ADLs (our grounding variables \(\Gamma\) ) and associated natural language samples. In the results section (Section 4.4), we describe how we generated and gathered samples for evaluation.
To apply our approach to an existing activity recognition solution, each label in our set of ADLs must be mapped to equivalent classes of the classifier model. ADLs not featured in the classifier model can simply be mapped to the “other” ADL class, and filtered from the annotated data if necessary. This mapping is done via a text file, where each line is a pair (common_ADL:class_label).

4.3 Dialogue

Fundamental to our approach is the use of dialogue in the labelling process, rather than simple text-to-speech annotation of activities. Here, we explain how we handle queries and process user responses.

4.3.1 Queries.

When a request for a query is received, the system will determine the type of query required based on; (i) whether potential labels are supplied; and (ii) the composition of those labels. For this to work, the activity recognition software should provide in its request a “best guess” suggestion of potential labels. This could be, for example, the labels of the two most likely classes or the labels selected by each learner in an ensemble. If potential labels are not provided, then the system will issue a general query. Example initial queries based on the supplied predictions are shown in Table 2.
Table 2.
Type \(\theta _1\) \(\theta _2\) \(\theta _3\) Example
1 \(\mathcal {L}_1\) \(-\) \(-\) “Can I confirm that you are currently \(\mathcal {L}_1\) ?”
  \(\mathcal {L}_1\) \(\mathcal {L}_1\) \(-\)  
  \(\mathcal {L}_1\) \(\mathcal {L}_1\) \(\mathcal {L}_1\)  
2 \(\mathcal {L}_1\) \(\mathcal {L}_2\) \(-\) “It looks like you are either \(\mathcal {L}_1\) or \(\mathcal {L}_2\) , can you tell me which is correct?”
  \(\mathcal {L}_1\) \(\mathcal {L}_1\) \(\mathcal {L}_2\)  
3 \(-\) \(-\) \(-\) “I’m not sure what you’re doing right now, can you tell me what you are doing?”
  \(\mathcal {L}_1\) \(\mathcal {L}_2\) \(\mathcal {L}_3\)  
Table 2. Example Queries
\(\theta _{1, 2, 3}\) are potential labels sent by the activity recognition software, \(\mathcal {L}_{1,2,3}\) indicate the predicted classes where there may be \(\mathcal {L}_{0...n}\) (i.e., there are many more possible combinations), and \(-\) denotes no value was provided. It may seem as though in some of the above cases a query is not necessary (e.g., when all suggested labels are the same), but it is a matter for the activity recognition query selection process to determine when to query, and it may be that the numerical data explains the need for a query (e.g., agreement between models but generally low confidence).
In an ideal scenario, the system intends to ask no more than two questions (e.g., Type 3 and then Type 2 if necessary) in a single active learning interaction. However, if user input is not understood (e.g., nonsensical utterances or utterances containing unrelated text), then the system will re-try the same question but prefaced with speech such as “I’m sorry, I didn’t understand what you said,” up to an adjustable maximum number of retries. The user can dismiss the system by registering the dismiss intent, which is triggered by phrases such as “Stop asking me questions,” although this only terminates the current line of questioning and does not prevent the system from issuing new queries. In practice, of course, it would be simple to disable queries for a temporary period on receiving this command.
Follow-up queries occur when the user-provided description is semantically similar to two or more ADLs, in an attempt to clarify which ADL is correct. For example, referring to Table 2, an initial query of Type 3 may lead to a follow-up query of Type 2. We utilise an ambiguity threshold \({T_{amb}}\) , a floating point number \(0\ldots 1\) representing the minimum required difference between the similarity score of the first and second highest scoring ADLs for one label to be considered an outright winner.
Follow-up queries are worded to lead the user into replying with an answer that is guaranteed to register with the semantic similarity checker as an unambiguous ADL, therefore priming the user to use a specific phrasing. To achieve this, the semantic descriptors for each ADL have a designated key descriptor, which it uses when formulating follow-up questions. If the user changes their phrasing in the follow-up response from the initial response to match the key descriptor, then the ambiguity is resolved and the correct label applied, assuming either of the provided options are correct. The key descriptors are highlighted in bold in Table 1.

4.3.2 Natural Language Processing.

Entity extraction is performed using Artificial Intelligence Markup Language (AIML) [17], with base templates provided for accepted responses to each type of query and cases to handle common utterance patterns. The AIML captures basic intentions such as “confirm,” “deny,” and so on, and will separate negatively labelled entities from positive ones, e.g., in the response “I’m not doing the dishes, I’m making my lunch,” the second entity is treated as the correct label.
A user-provided label is compared against sample word vectors describing each ADL, which in this case is our own dataset introduced earlier, in Section 4.1. For the purposes of our methodology, we will refer to these descriptors simply as ADL descriptors.
A SpaCy model,en_core_web_lg, is used to compare the vectorised user description to the sample descriptors [21]. Word vectors, such as those generated by Word2Vec, extrapolate words into multi-dimensional representations that inherently encode semantic meaning [10]. Word2Vec itself is based on the idea of using learned weights in a neural network as a lookup table. In this case, the model is trained to predict the next word in a given sequence. In doing so, the model learns contextual information about which words are frequently close to each other. The model has a weight matrix representing each word as 300 floats, where weights of the model are incidentally the word vectors. Words that are more similar to each other should be closer together within the multi-dimension representation. We use the largest of the available English-language models, which contains a vocabulary of 685,000 unique word vectors.6 Words and phrases are compared directly with each other, provided they can be mapped to the vocabulary of the model.
The most likely matching ADL is determined as specified in Algorithm 1. Sorted scores are returned so that cases of competing potential matches can be utilised in query generation (e.g., for a follow up question of Type 2 in Table 2).

4.3.3 User Commands.

If the user provides a command to the system outwith the dialogue flow for query-based annotation, then the utterance is processed by a trained RASA model [4]. We allow users to manually denote the start and end of an activity to be saved as training data. For example, the user could say “I want to show you how I make breakfast,” and the dialogue system would inform the activity recognition software to begin recording training data. Likewise, if the user said something like “I am done teaching you,” then the recording would end. After ending a recording, a query will be automatically issued to ask the user to label the new data if they have not already provided one.

4.4 Experimental Evaluation and Discussion

We evaluate the usefulness of our natural language descriptors dataset and our semantic similarity approach by splitting the descriptors into comparison vector/test vector sets and running it through the process outlined in Algorithm 1.
We split our set of ADL descriptors in a class-wise fashion such that 30% of samples were retained for testing, i.e., 30% of samples provided for each class were used for testing, where some classes have more or less samples than others. With a total of 390 ADL descriptors, we have 276 vectors to which we can compare 114 test vectors. Examples of these descriptors/vectors are shown in Table 1. We test this module independently, outside of the NLP pipeline.
In the first instance, our tests show that 73.68% of test vectors were correctly matched to their corresponding common ADL. However, this number shows that it is not always safe to accept the top match straight away, and here we are able to use further dialogue to help with this problem.
To deal with ambiguity, i.e., where there are two of the common ADLs that closely match the user-provided descriptor, we introduce the threshold \(T_{amb}\) . The impact of varying \(T_{amb}\) on the total number of questions asked to the user and its potential effect on the overall performance in applying the correct label is shown in Table 3.
Table 3.
\(T_{amb}\) Follow-up (% of time)Theoretical Max. Acc. (% correct labels)Follow-up Questions
0.2078.07%90.35%100
0.1567.54%89.47%88
0.1049.12%89.47%66
0.0527.19%87.71%35
Table 3. Validation of Semantic Similarity Approach with Varying \(T_{amb}\) Values, Showing the Percentage of the Time in Which Follow-up Questions Would Have Been Posed to the User and the Theoretical Maximum Labelling Accuracy Achievable Using the System
The total number of test vectors is 114.
Considering that 73.68% of vectors were matched correctly with no ambiguity, we can see that a higher tolerance for acceptable ambiguity (lower \(T_{amb}\) ) leads to significantly more follow-up questions. The theoretical maximum accuracy value represents the sum of all initiated dialogues that: (i) were resolved in the first round, and (ii) presented the correct ADL label in a follow-up question as an option for the user to choose. The remaining percent of the time are the cases where the system would fail to match a user descriptor of an ADL to one in the set of ADLs. Some of these cases could be handled through further stages of dialogue, while others are the result of a high confidence label being incorrectly applied at the first round.
The results suggest it is possible to have a low ambiguity threshold without significant impact on the results, with a threshold as low as 0.05 still providing a significant uplift in the maximum overall labelling accuracy over the baseline (73.68%) while only requiring a follow-up question less than a third of the time (30.7%). In reality, we should assume that the success rate of follow-up questions may be lower, given that some users may persist with ambiguous phrasing even after being primed.
These results indicate that limiting the labelling interaction to a single back-and-forth query-answer between the system and the user would restrict the usefulness of natural language for labelling. While we have used follow-up questions here, this opens the door to more involved dialogues that allow the system to interrogate the user further when necessary.

5 Applications to Active Learning Scenarios

In Sections 6 and 7, we demonstrate the generality and flexibility of our dialogue-based user interface for active learning. Given two different example activity recognition systems and query selection strategies, we use the same common interface for the dialogue system and highlight the practicalities of our approach in practice.
In the first use case, we see a data-driven approach that performs query selection based on disagreement between a committee of learners. This emphasises how active learning and our user interface can be added to existing data-driven activity recognition systems.
In the second case, we consider a single learner that instead performs query selection based on different segmentations of the incoming data. Using a hybrid approach to activity recognition (combining parts of data- and knowledge-driven approaches), we demonstrate our interface used in online conditions. In doing so, we also introduce a novel method for query selection that takes advantage of online segmentation. This method is applicable to active learning more broadly, and is not specific to our specific dialogue-based interface.

6 Scenario 1: Committee of Learners

Here, we consider an activity recognition system that utilises a committee of learners to enable active learning, as depicted in Figure 3. We apply a Query by Committee active learning approach to those learners as a means of selecting samples of sensor data for labelling and to provide suggested labels to the dialogue system. We then utilise an existing CASAS dataset [15], to simulate long-term active learning. We compare our approach, using simulated human responses backed by our descriptors dataset, to an oracle.

6.1 Problem Definition and Symbols

In addition to the symbols introduced in Section 3, the following notation is relevant for this section.
Committee of Learners. Consider then a committee of \(n\) learners \(\mathcal {C} = \lbrace \theta _1,\ldots , \theta _n\rbrace\) , and for each learner there is prediction \(P^\theta _t = P^\theta (\mathcal {L}_1, \ldots ,\mathcal {L}_m|E_t)\) , where \(\sum ^m_{i=1} P^\theta (\mathcal {L}_i|E_t) = 1\) .
Learner Disagreement. Kullback-Leibler divergence \(D_{KL}\) is used to calculate the divergence between the learners. \(D_{KL}\) is a measure of divergence \(||\) between two probability distributions \(P_{A}\) and \(P_{B}\) , in discrete form:
\begin{equation} D_{KL}(P_{A}||P_{B}) = \sum _{i} P_{A}(i) \log \left(\frac{P_{A}(i)}{P_{B}(i)}\right). \end{equation}
(2)
In the case of two or more probability distributions, divergence of each \(P\) can be calculated by comparison to the mean, which we will call the consensus probability \(C_{prob}\) . With three learners:
\begin{equation} C_{prob} = \bar{x} (P^{\theta _1}_t, P^{\theta _2}_t, P^{\theta _3}_t). \end{equation}
(3)
Maximum disagreement \(MD\) can then be considered as the arg max of \(D_{KL}\) calculated for each learner. Samples that produce a high \(MD\) score are selected for labelling and are known as \(E_{C}\) .

6.2 CASAS Dataset

We utilised the CASAS “Kyoto” dataset7 to emulate a long-term active learning scenario, which contains data from a variety of sensors around the home, to demonstrate its usefulness in a home setting.
The dataset is composed of sensor events from around the home (pressure sensors, binary contact sensors, etc.), and provides labels for: sleeping, relaxing, working, leaving the house, bathing, cooking, eating, taking medication, transitioning from bed to toilet, or other. We use software by Liciotti et al. [30] to transform the sequential event dataset such that the events of each activity are represented by a single sample. The dataset covers 250 days and 6,737 discrete activities once transformed, an average of \(\approx\) 27 activities per day.

6.3 Models

To form a committee of learners, we utilise Scikit-learn [41] to create three models. As a demonstration of the dialogue system, the composition of model types is not particularly important, so long as they are able to provide probabilistic predictions. Here, we used the following models to form a committee: a Random Forest (RF) classifier [7] ( \(\theta _1\) ), a Bagging Classifier (BC) [6] ( \(\theta _2\) ), and a Decision Tree (DT) ( \(\theta _3\) ). The BC has a DT base estimator.

6.4 Query Selection

The goal of query selection in activity recognition, as in other domains, it to detect uncertainty. We adopt a Query by Committee approach [33, 45] as described in Algorithm 2, in which we detect moments of disagreement between several models.
Using the class probabilities for each prediction from each model, we compute maximum disagreement by estimating Kullback-Leibler divergence (KLD) [44] as detailed in Algorithm 3.
Once a maximum disagreement value is obtained, a query is triggered when the maximum disagreement threshold \(T_{MD}\) is exceeded. The choice of models, and the ensemble of learners that they form, impacts \(T_{MD}\) , as does the amount of initial training data. The more accurate the models are to begin with from the initial training data, the smaller the threshold can be—assuming the strategy as in this article, where we started with some pre-labelled data.
This threshold varies based on the data and/or models used: A useful heuristic is to measure how many queries occur using the initially trained learners over a one-day period and to then adjust the threshold until the desired number of maximum queries/day is achieved (i.e., how many times is the system allowed to disturb the user per day initially).
Of course, while we have utilised a threshold to enforce our heuristic method of limiting the number of queries issued per day to the user, once the maximum disagreement value has been calculated it is possible to select any appropriate method to determine when to initiate a query based on the target application.

6.5 Experimental Evaluation and Discussion

In this section, we utilise the CASAS dataset to demonstrate and evaluate the use of the dialogue-based active learning mechanism. We highlight the impact of varying parameters within the system and the effect this has on the frequency of queries issued and the amount of follow-up dialogue required.

6.5.1 Preparation.

We split the dataset into training, query, and validation sets. The training and validation sets consist of \(\approx\) 5% of the dataset from the chronological start and end of the dataset, respectively. The validation set is used to validate the initial model and each AL-trained model at each re-training interval, and is never made available for annotation by the end-user. The remaining \(\approx\) 90% of samples are made available for labelling in the query set.
We mapped the class labels of the dataset to our set of common ADLs, as in Figure 4, which also shows all of the common ADLs represented in our set of descriptors.

6.5.2 Active Learning.

We re-train the learners after 25 queries. In practice, to ensure the system takes into account new annotations/evidence straight away, re-training after every query is ideal. Here, it was not practical to simulate the long-term development of the CASAS models with this strategy due to training times, and so a fixed rate is used—this does not impact our illustration of long-term active learning. After each re-training event, the models are validated on the validation set.
Baseline performance is provided by an oracle that always provides the correct answer, bypassing the dialogue system. This simply demonstrates the effectiveness of an active learning strategy upon the dataset. We set \(T_{amb}\) for these experiments to 0.1.
We compare this to simulated responses, using the same data as in Section 4.4. During run-time, the ground-truth label is used to randomly select an appropriate test descriptor from the ADLs mapped to the model labels—see Figure 4. For illustrative purposes only, we also wrap the descriptor in a complete sentence using a function that reverses the AIML entity extraction to randomly select an utterance: these are indicative of sentences that the system could handle when provided by a real user. The end result is a sentence such as: “I was just about to wash the dishes”—once again, the descriptor is underlined.
Compared against the oracle, we can see whether the labels extracted from natural language are able to improve learner accuracy over time as the oracle would. A successful result here would indicate that the dialogue system is useful as a means of annotation.

6.5.3 Results.

Learner accuracy for the oracle and simulated human responses are shown in Figure 5. Accuracy at zero queries is the baseline accuracy of the models trained using only the initial training set. The graph shows the number of dialogues initiated, which may or may not include a single follow-up question, and shows a comparison of active learning using the oracle vs the simulated responses. The inclusion of the oracle results simply serves to show that there is minimal loss using the dialogue-based annotation versus an oracle.
Fig. 5.
Fig. 5. Learner accuracy over time for each model, showing ground-truth annotation and annotation using simulated human responses and our dialogue system. The dotted lines across the graph represent baseline performance without subsequent active learning. Accuracy is measured always against the validation set, which is never used for active learning. \(\theta _1\) = Random Forest (RF), \(\theta _2\) = Bagging Classifier (BC), and \(\theta _3\) = Decision Tree.
These results also highlight the reusability of the approach, in that the user may have uttered one of any of the 33 ADLs and those ADLs were then grounded to the labels of the classifiers/dataset using the semantic similarity process and label mapping. The data, models, and mapping were different, but the dialogue system, ADL descriptors, QbC strategy, and semantic similarity process remained the same. This shows how our approach enables a process of transforming: natural language utterances \(\rightarrow\) a common ADL label \(\rightarrow\) a classifier/dataset label.
Using the best performing learner from the results in Figure 5, the Random Forest classifier, we can also look at the impact of varying the maximum disagreement threshold \(T_{MD}\) . Table 4 shows that a lower threshold leads of course to more frequent initiation of dialogue but also shows that the maximum achieved accuracy over the 220.31 days (based on average of 26.95 activities per day) in the query set is not severely impacted with a higher threshold. For example, \(T_{MD} = 0.75\) generates \(1.6\times\) more queries than \(T_{MD} = 0.90\) , leading to at least two more dialogues being initiated per day to gain \(2.25\%\) in maximum accuracy. If we select a milestone for accuracy, e.g., 70% given the baseline of \(63.25\%\) , then we can see how long it takes to get there based on the query rate. These results highlight the importance of carefully considering the trade-off between how frequently to disturb the user and how quickly the system makes performance gains.
Table 4.
\(T_{MD}\) Dialogues InitiatedAvg. Dialogues/DayDays to 70%Final Acc. (%)Max. Acc. (%)
0.751,3266.02128.0874.074.75
0.801,1505.22160.9272.573.25
0.859264.20148.8172.073.25
0.908253.74170.0572.572.50
0.955252.38206.3070.570.75
Table 4. Impact of Varying \(T_{MD}\) Threshold on: Number of Dialogues Initiated; Average Number of Dialogues Initiated Per Day; Number of Queries Elapsed Before Reaching 70% Accuracy; Final Accuracy at the Point of Sample Depletion; and Maximum Accuracy Achieved During the Active Learning Process
There are 220.31 days of data in the query set based on the average of 26.95 activities per day. Results were generated using the Random Forest (RF) classifier and the simulated human responses.
An example dialogue from the system while performing active learning on the CASAS dataset using the dialogue system is shown below. Key descriptors and comparison vectors are underlined:
System: I need some help understanding what you are doing just now. Can you tell me what you are doing?
User: I’m just washing up after dinner.
Prediction: WashingDishes = 0.855, Cooking = 0.816, ...
System: I’m having a little trouble matching what you said to my activity labels, but I think I’ve narrowed it down to two. Are you doing dishes or cooking?
User: You are correct, I am doing dishes.
Prediction: WashingDishes = 1.0, Cooking = 0.805, ...
System: Ok, I have recorded that you are doing dishes.
Annotation: Work (ADL “WashingDishes” \(\mapsto\) CASAS “Work”)

7 Scenario 2: Online Segmentation

Second, we consider an activity recognition system that utilises a single model and takes advantage of online segmentation to enable active learning, as depicted in Figure 6. Here, we take a hybrid approach to activity recognition that combines some domain knowledge with a MLN-based model. We inspect the sensor data and model predictions during the current segment to determine where a query might be useful. In this section, we demonstrate how online activity recognition systems can utilise our dialogue and natural language approach described in Section 4 to resolve distinct types of learning opportunities that may arise in a home environment.
Fig. 6.
Fig. 6. Architecture of an online segmentation approach to active learning for activity recognition. In this architecture, predictions are made on different possible segmentations of the timeseries data. Queries are issued based on whether predictions across segments agree with each other, and based on whether events inside segments are consistent with current predictions. Labels are only applied to segments once the segment is finalised. Purple blocks are software components, yellow circles are models and data structures, black blocks are components external to the system, and red diamonds are decisions points.

7.1 Problem Definition and Symbols

In addition to the symbols introduced in Section 3, the following notation is relevant for this section.
Segments. Rather than make predictions solely on \(E_{t}\) , we instead make predictions on a segment of data. A segment \(S\) contains an interval of data from \(E\) , where \(S \subset E\) , from the most recent event to some point in the past, e.g., \(S = \lbrace E_{t-\tau },\ldots , E_{t}\rbrace\) . We consider that two possible overlapping segmentations with differing start points \(\tau\) may be maintained at any one time, \(S = \lbrace \sigma _{1}, \sigma _{2}\rbrace\) , enabling the learner to generate competing predictions in a time step. This concept is elucidated further later in this section.
Predictions. At every time step \(t\) , the learner appends \(E_{t}\) to each segment in \(S\) . We have a situation of having multiple possible predictions while only using a single learner, which enables us to utilise the dialogue system. Modifying the predictions definition from Section 3, we now have \(P^{\sigma }_t = P^{\sigma }(\mathcal {L}_1,\ldots ,\mathcal {L}_m|\sigma)\) , where \(\sum ^m_{i=1} P^\sigma (\mathcal {L}_i|\sigma) \le 1\) .
Agreement. We check whether \(E_{t}\) is in agreement \(A\) with the previous prediction: \(a : E_{t} \rightarrow A\) , where \(A = \lbrace 0, 1\rbrace\) . We define \(E_{t}\) as being in agreement with \(P^{\sigma }_t\) when \(E_{t} \in E_{K}(P^{\sigma }_{t-1})\) .
Segmentation. Segmentation involves reasoning about predictions. It uses the agreement values \(A = \lbrace A_{\sigma _{1}}, A_{\sigma _{1}}\rbrace\) to determine when an alternative prediction is needed. The system maintains only a single segment \(\sigma _{1}\) and makes predictions \(P^{\sigma _{1}}\) , until \(A_{\sigma _{1}} = 0\) . At which point, we use \(\sigma _{2}\) to track a an alternative theory. We now consider both sets of predictions until we become confident that the new segment contains the correct activity or one of the segments is labelled, in which case, we can eliminate the competing segment and associated predictions. The process can then be repeated.
A segment can also be finalised, i.e., ended, at any time for other reasons determined by the meta-reasoner. Upon segment finalisation, the system returns to maintaining only a single segment.
Query Selection. To determine when to query, we need to know \(A_{\sigma _{1}}\) , \(A_{\sigma _{2}}\) , and whether \(P^{\sigma _1}_t\) and \(P^{\sigma _2}_t\) have consensus by selecting the same label: \(C = (P^{\sigma _1}_t \wedge P^{\sigma _2}_t\) ). This provides us data to perform query selection, and to help determine which segment of data to apply any resulting label to once it is received.

7.2 Testbed

We use our MLN-based activity recognition system in our testbed, the RALT at Heriot-Watt University,8 which is a 60 m \(^2\) fully furnished apartment consisting of a bedroom, bathroom and combined kitchen/dining and living areas.
The testbed is equipped with a range of devices to enable activity recognition, including a Loxone smart home system, simple binary sensors, motion (passive infrared) sensors, smart appliances, appliance power measurement, pressure pad sensors. We utilise these sensors as sources of information in our approach.

7.3 Model

Our hybrid activity recognition approach combines knowledge about sensors and ADLs with a probabilistic MLN model. We designed our approach to use the dialogue system introduced in Section 4, wherein users can create their own model by adding activities manually at first, and then improve that model in the long term with active learning and additional manual training where necessary. We utilise pracmln by Nyga et al. to create MLNs and databases, and to utilise MLN training (Discriminative Pseudo-likelihood Learning) and inference (MC-SAT [42]) algorithms in Python [38].

7.3.1 Markov Logic Networks (MLNs).

We provide here a cursory introduction to MLNs to the extent required to understand how it is used in our approach for activity recognition. For a more in-depth explanation, it is recommended to consult Domingos et al. [16].
Knowledge in First-order Logic (FOL) is represented as a set of formulas (or rules) composed of symbols, variables, functions, and predicates. First-order logic allows us to declare objects and variables, and to establish the relationships between them with functions and predicates. For example, the statement “Jack and Victor are friends” may be represented in first-order logic as atomic sentence \(Friends(Jack, Victor)\) . A set of formulas is known as a Knowledge-base (KB).
Markov Logic Networks “soften” first-order logic with probabilities: “when a world violates one formula in the KB it is less probable, but not impossible. The fewer formulas a world violates, the more probable it is” [16]. In essence, we can assign weights to each rule in a KB to create larger differences in log probability between worlds that satisfy a formula, and worlds that do not. In the now classic example, the statement “if someone smokes, they are more likely to get cancer” can be written with some weight \(w\) as “ \(w \; (\forall x) Smoking(x) \Rightarrow Cancer(x)\) .” Weights are typically learned from observed facts, i.e., from training samples.
Ultimately, MLNs allow us to infer knowledge about the world, given only a partial grounding of variables. We want to determine the truth values \(X\) for all predicates, when given some groundings as evidence \(x\) , what are the most probable values of predicates \(y\) , which we are unable to ground from observation, i.e., \(p(y|x)\) . The satisfiability of the first-order logic formulas of the MLN is given according to Bayes’ theorem. Definition from Cheng et al. [9]:
\begin{equation} P(X = x) = \frac{1}{Z} \exp \left(\sum _{i} w_{i} n_{i}(x)\right), \end{equation}
(4)
where \(n_{i}(x)\) is the number of true groundings of FOL formula \(F_{i}\) in \(x\) . \(X\) may contain multiple parts that correspond to the same template formula \(F_{i}\) with different truth assignments, and \(n_{i}(x)\) only counts the assignments that make \(F_{i}\) true.

7.3.2 Approach.

Our approach involves automated creation of a new MLN model and associated training database for a user. To do so, our software needs prior knowledge of: (i) predicates used in construction of the model; (ii) the rules for which weights are to be learned during training; (iii) the range of ADLs that may be performed, and optionally which locations in the home they may be performed; and (iv) the sensor events that may be received by the system. The first two of which (i/ii) are largely static as they are embedded in the functionality of the activity recognition process, while the latter two (iii/iv) can be expanded as necessary. Our model uses the same set of labels as in the dialogue system, foregoing the need for mapping between the two label sets. The sensors events are those broadcast by the software that manages the sensors in our testbed.
Listing 1.
Listing 1. Example of MLN file created by our activity recognition software. Lines 2 and 3 define the domains (sets of constants), Lines 6 and 7 define predicates (relationships), and Line 10 represents a formula template that will be expanded during training.
An example of an untrained MLN based on the aforementioned information is shown in Listing 1. Note that events and ADLs are represented as domains, while predicates allow us to use the MLN for classification. We define initially only one rule, which is actually a template “that will generate one formula for each possible binding of that variable to one of the domain elements applicable to that argument” [38]. In essence, the rule (Listing 1, Line 10) implies that the current sequence is most likely to be of the same class as training sequences (samples) that involved the same events. Each generated rule based on the original definition will have its own weight, learned through training.
Since MLNs are used to estimate the most probable groundings of predicates, given that some variables are known (grounded) and others are not (ungrounded), we utilise this to estimate activities from observations we get from sensors. The things that we do know, the evidence, are given to the MLN via a database. Based on the evidence, the model is able to estimate the groundings of the unknown variables, e.g., the class of the sequence. In our approach, databases are equal to segments \(S\) . The model makes predictions based on segments presented to it by the online segmentation module, which is discussed in the following section.
We can also include temporal information in the training data. We can use a domain to represent simple time of day information and a suitable predicate, e.g., during(x, time_of_day), where time_of_day = {Morning, Afternoon, Evening, Night}. Likewise, we can record from the training data the order of activities, such that the previous activity can help inform the next, e.g., after(activity,activity).

7.4 Online Segmentation and Query Selection

Online segmentation manages the segments as described in Section 7.1. In summary, we begin considering an alternative prediction of the ongoing activity when an event occurs that is not consistent with training data for the current prediction, utilising concurrent segments \(\sigma _{1}\) and \(\sigma _{2}\) .
This strategy is consistent with a concept from human psychology, Event Segmentation Theory (EST), which posits generally that humans segment perceptions as it becomes harder to predict accurately what will happen next, i.e., “when transient errors in predictions arise, an event boundary is perceived” [57]. This theory has been tested previously using video segmentation, with participants having been shown videos comprising an activity and several sub-tasks, e.g., building a tent [8]. Participants tended to segment task boundaries, in part, based on changes in objects, time, and space.9
This means that in comparison to typical active learning approaches for activity recognition, our approach can consider that the sample that is labelled may itself be influenced by user input. Figure 7 illustrates at a high-level a typical active learning pipeline for activity recognition, in comparison to our approach.
Fig. 7.
Fig. 7. High-level comparison of active learning pipelines for activity recognition. In our approach the response of the end-user influences the sample that is labelled, taking advantage of online segmentation. As in Figure 2, purple blocks are software components, yellow blocks are models and data structures, and small shaded squares are individual sensor events that have been segmented.
The online segmentation process filters out the routine moments (e.g., transitioning from one activity to another), from those that are important active learning opportunities (e.g., the user does something different during an activity than is in the training data). It must facilitate the labelling of its segments once a label has been extracted.
When there are two concurrent segments, we use values \(A_{\sigma _{1}}\) , \(A_{\sigma _{1}}\) , and \(C\) (consensus) to perform query selection. We consider that the user has simply moved from one activity to the other when the confidence of the new activity in \(\sigma _{2}\) has surpassed both \(\sigma _{1}\) and a heuristic minimum confidence threshold \(T_{MC}\) . Otherwise, we consider the need for a query. We treat segments as “developing” until at least one class surpasses \(T_{MC}\) , “confident” thereafter, “inconsistent” when the latest events contradict the previously confident class, and “unsure” thereafter until the original predicted class becomes confident again or a transition to a new activity is detected.
Figure 8 illustrates online segmentation and query selection. It shows how the concurrent segments are used in practice, with this particular example showing a case where the user is cooking but interleaves some events that are not in the training sequences. This leads the system to consider that the user has switched to a new activity. The result in this case is that we have two distinct predictions about the activity, which we can then use to query the user. If the user label is found to be “Cooking,” then we label \(\sigma _{1}\) , whereas if it is “PreparingDrink,” then we label \(\sigma _{2}\) from when it became active onward. If we are unable to match the user response to either of the two labels, then we would not label either segment due to the risk that the activity does not align with the segmentation, and instead prompt the user to teach the system the activity next time.
Fig. 8.
Fig. 8. Visual example of online segmentation. From left to right, we see how both segments are used to make predictions about the ongoing activity. Initially, predictions on \(\sigma _1\) develop to be confident that the current activity is “Cooking,” until an unexpected event occurs. After which, it still believes “Cooking” to be the most likely, but is less confident than before. From the unexpected event onward, predictions on \(\sigma _2\) become confident that the current activity is “PreparingDrink.” At this point, you see that the system now queries the user about their activity using the two alternate labels.
We do not apply a label to a segment until the segments are finalised. Here, we finalise a segment when any of the following conditions are met: (i) events are being generated from a location that is disjoint from the prediction at the previous time step, e.g., “Bed” event in the bedroom when the previous prediction was “Cooking” in the kitchen; (ii) the system determines that the user has moved from one activity to the other, as previously defined; or (iii) when a label is applied to one of the segments, the other non-labelled segment is finalised.

7.5 Experimental Evaluation and Discussion

Table 5 shows how in this approach, the query selection process relates to and makes use of the query types introduced in Table 2 of Section 4. In this section, we make reference to this table to show how each of the listed examples may occur in practice. We will demonstrate each example, showing the events that lead to uncertainty, the online segmentation, the issuing of a query, and how responses to queries can change the behaviour of the system in the future. In these examples, we set the minimum confidence threshold \(T_{MC} = 0.3\) . As with \(T_{MD}\) in the previous scenario, varying this threshold impacts the amount of queries ultimately issued by the system.
Table 5.
Example \(A_{\sigma _{1}}\) \(A_{\sigma _{2}}\) \(C\) TypeExample Scenario
1TrueTrueTrue1Unexpected event, prediction ultimately unchanged
2FalseFalseTrue3Several unexpected events, no alternative prediction
3TrueTrueFalse2Events are consistent with two distinct predictions
4AFalseTrueFalse2Two distinct predictions, events consistent with only one
4BTrueFalseFalse2Two distinct predictions, events consistent with only one
5FalseFalseFalse3Events are not consistent with either distinct prediction
Table 5. Type of Query That May Be Issued Based on Values of \(A_{\sigma _{1}}\) , \(A_{\sigma _{2}}\) , and \(C\)
The system may choose not to query some cases, depending on confidence of the learner, e.g., when moving from a high confidence prediction over \(S_{1}\) to a high confidence prediction over \(S_{2}\) . Note there are two combinations of ( \(A_{\sigma _{1}}\) , \(A_{\sigma _{2}}\) , \(C\) ) that cannot occur: (False, True, True) and (True, False, True).
We start with a basic understanding of a handful of activities, wherein the user demonstrates a single training sample to the system for each. In our examples here, we focus on kitchen activities and show how each of the example query scenarios may arise. The initial training samples for kitchen activities are shown in Listing 2. For each example, we return to the same base model (unless specified otherwise) so that we can cleanly demonstrate the effect of each query scenario and query type.
Listing 2.
Listing 2. Initial training samples for kitchen activities taught using the dialogue system to start and end sample recording. Lines with “- - -” are used to denote separate training samples in pracmln, i.e., variable grounding “S” is independent across samples.

7.5.1 Example 1.

We first consider that while performing one activity the user interjects an unexpected event that is not in the training samples, but then completes the rest of the activity as normal. See Figure 9, where the “Fridge” event is unexpected.
Fig. 9.
Fig. 9. Example 1 lineplot showing the development of predictions for segments \(\sigma _{1}\) and \(\sigma _{2}\) . The green vertical line (leftmost vertical line) indicates the point at which \(\sigma _{2}\) became active. The purple vertical line (rightmost vertical line) indicates the point at which a query was issued.
We see that at the point of the unexpected event, the second segment is activated, but both segments ultimately come to agree that the activity is “PreparingDrink.” Here, the activity recognition system requests a query of Type 1 to confirm that the activity is, in fact, “PreparingDrink.” If the user responds positively to this query, then the system applies the label to \(\sigma _{1}\) and re-trains. In a subsequent run with the same event sequence, see Figure 10, there is no doubt about the activity and there is no query issued.
Fig. 10.
Fig. 10. Example 1 lineplot showing the development of predictions for segment \(\sigma _{1}\) only, while \(\sigma _{2}\) is not activated. No query is issued.

7.5.2 Example 2.

Now consider that the user is performing an activity that they have not taught the activity recognition system. We can see in Figure 11, that events occur within both segments that are not consistent with the prevailing prediction in that segment.
Fig. 11.
Fig. 11. Example 2 lineplot showing the development of predictions for segments \(\sigma _{1}\) and \(\sigma _{2}\) . The green vertical line (leftmost vertical line) indicates the point at which \(\sigma _{2}\) became active. The purple vertical line (rightmost vertical line) indicates the point at which a query was issued.
Here, the query is of Type 3, which can be used in situations where the user may be performing an activity that is not currently captured in the training samples. The extracted label is “WashingDishes,” and it is applied by the system to \(\sigma _{1}\) before re-training—it is the default behaviour for Type 3 queries to label the longest of the two segments. After another run with the same event sequence, Figure 12 shows that the new activity is recognised thereafter and \(\sigma _{2}\) is not activated.
Fig. 12.
Fig. 12. Example 2 lineplot showing the development of predictions for segment \(\sigma _{1}\) only, while \(\sigma _{2}\) is not activated. No query is issued.

7.5.3 Example 3.

For this example, we added the “Fridge” event to the training sample for “PreparingDrink” to show what happens when events appear to be consistent with two distinct predictions. In Figure 13, we can see that the predictions over \(\sigma _{1}\) and \(\sigma _{2}\) point toward different activities with high confidence, although in fact the user was always “Cooking.”
Fig. 13.
Fig. 13. Example 3 lineplot showing the development of predictions for segments \(\sigma _{1}\) and \(\sigma _{2}\) . The green vertical line (leftmost vertical line) indicates the point at which \(\sigma _{2}\) became active. The purple vertical line (rightmost vertical line) indicates the point at which a query was issued.
In this case, the activity recognition system needs the user to clarify which activity they are performing by issuing a Type 2 query with options “Cooking” and “PreparingDrink.” If the user responds with “Cooking,” then the system will label \(\sigma _{1}\) and re-train. Here, this has the effect of making events that were previously exclusive to one activity (e.g., “Kettle” as an indicator of “PreparingDrink”) an indicator of multiple activities. In future, the prediction will depend more on the specific combination of events within the segment.

7.5.4 Example 4.

Here, the user switches from one activity to another. In Figure 14, we can see how \(\sigma _{2}\) starts as the user switches from “Cooking” to “PreparingDrink,” which eventually becomes the prevailing activity in both segments. Notice that here we do not expect a query, since the confidence of the new activity rises quickly based on the events in \(\sigma _{2}\) as the confidence of the previous activity falls based on the events in \(\sigma _{1}\) .
Fig. 14.
Fig. 14. Example 4 lineplot showing the development of predictions for segments \(\sigma _{1}\) and \(\sigma _{2}\) . The green vertical line (leftmost vertical line) indicates the point at which \(\sigma _{2}\) became active.
In this case, the activity recognition system clears \(\sigma _{1}\) and continues as normal, taking \(\sigma _{2}\) to be the segment representing the actual current activity without user input. Transitions like these make up the majority of cases when \(\sigma _{2}\) is utilised, with querying of the user only happening when necessary. We will discuss in Section 9 the problem and necessary trade-offs of trusting the predictions of the activity recognition system versus querying the user for labels.

7.5.5 Example 5.

The final example situation is similar to Example 2, with the main difference being that although in this case there are two alternative predictions that would make a Type 2 query possible, the low overall confidence for both segments suggests that the activity underway is one the system is not able to recognise. As such, a query of Type 3 is utilised. Figure 15 depicts the unfolding of this example activity.
Fig. 15.
Fig. 15. Lineplot showing the development of predictions for segments \(\sigma _{1}\) and \(\sigma _{2}\) . The green vertical line (leftmost vertical line) indicates the point at which \(\sigma _{2}\) became active. The purple vertical line (rightmost vertical line) indicates the point at which a query was issued.
Provided the user labels the new activity, the system will save \(\sigma _{1}\) as a new training sample and will re-train. Here, the label provided is “Cleaning,” and we can see in Figure 16 that the new activity is recognised thereafter.
Fig. 16.
Fig. 16. Lineplot showing the development of predictions for segment \(\sigma _{1}\) only, while \(\sigma _{2}\) is not activated. No query is issued.

8 User Evaluation

We conducted a user study where we had users annotate activities in our testbed using: an established manual annotation method, where participants denote the start and end of activities using fixed phrases; and our method using natural language. We compare the two methods based on user responses to System Usability Scale (SUS) and NASA Task Load Index (NASA-TLX) questionnaires. The purpose of this study is to evaluate the usability of our user interface relative to an existing approach, and as such the study is agnostic to any specific activity recognition and segmentation back-end.

8.1 Setup

Our study involving human participants was approved by the appropriate Ethics Committee at Heriot-Watt University. Recruitment was conducted through research group social media channels, school/centre mailing lists, as well as through posters/flyers distributed across campus.
We collected limited demographic data only as required by the University. Participants were mostly between the age of 20–40 and of Caucasian European background, and were predominantly male. This is a limitation of our lab-based validation. Given that the study was conducted during the COVID-19 pandemic, it would not have been responsible to pursue recruitment of individuals from our ideal demographic (the elderly), as this would have required them to come physically to our testbed. We hope that in future work, we are afforded the opportunity to recruit a sample more representative of our target demographic, enabling us to update our dataset and validate the results in field trials.
We had 12 participants annotate a set of ten activities simulating a morning routine using two speech-based methods. Participants were given a list of ADLs to perform in our testbed (the same for both methods), and were instructed appropriately before each condition as outlined below. Participants performed and evaluated both methods, with half of participants performing Method 1 first, and the other half performing Method 2 first.

8.1.1 Method 1: Manual Annotation.

The first method is based on a manual annotation using speech, where participants were instructed to mark the start and end of each activity using their voice and were provided with a fixed set of labels that could be used. Participants were given time to familiarise themselves with these labels prior to the labelling session, but did not retain the label sheet during the session. We instructed participants to label in the format “[activity] start/end.” For example, the participant may say “cooking start” at the start of preparing food and finish with “cooking end.” This is similar to the annotation approach in Reference [52], where participants annotated their activities while wearing a headset.

8.1.2 Method 2: Natural Language Annotation.

The second method has participants annotate their activities using natural language. We utilised a domestic robot to act as the dialogue partner during the activities, and had it initiate conversation at four points during the activity execution, which is more frequent than we would expect in reality based on the results in Table 4. We instructed participants to answer any questions posed to them by our agent about their activities and to answer the questions asked in good faith as if talking to another person.
After completing each method, we had participants complete a SUS questionnaire with a 5-point Likert Scale and a NASA-TLX questionnaire. The questions asked in our usability questionnaire are as follows:
(1)
I think that I would like to use this system frequently.
(2)
I found the annotation method unnecessarily complex.
(3)
I thought the annotation method was easy to use.
(4)
I think I would need the support of a technical person to be able to use this method.
(5)
I imagine that most people would learn to use this annotation method very quickly.
(6)
I felt very confident using the annotation method.
(7)
I needed to learn a lot of things before I could get going with this annotation method.

8.2 Results

Responses to our SUS questionnaire are summarised in Figure 17. Findings show that more participants would like to use the natural language annotation/dialogue-based method (Method 2) frequently. Furthermore, the method was also the one that individuals were most likely to find easy to use, felt more confident using it, and felt they needed to learn less to use it.
Fig. 17.
Fig. 17. Box plots for the System Usability Scale questionnaires for each annotation method. Median is the line through the middle of the boxes, while the bottom and top of the boxes represent the lower and upper quartiles. The whiskers below and above the box end at the points that correspond to the lowest and highest scores. Outliers are represented by red plus symbols.
A one-way ANOVA test of our SUS results indicate that the improved results in Questions 1, 2, 3, 5, and 7 are statistically significant for \(p \lt 0.05\) . Participants generally indicated that for both methods they would likely not require support of a technical person to use the method (Question 4), and tended to feel confident using either method (Question 6).
We provide the averaged scores for the TLX questionnaire for each method in Table 6, and also illustrated in Figure 18. We used an unweighted approach; however, from these results, we can get most of what we are interested in from the mental demand, effort, and frustration results. While neither method appears to be overly taxing, we can see the impact of how in Method 2 the initiative is placed almost entirely on the dialogue system/agent, in that the mental demand and effort scores are notably lower. Participants also generally felt less frustrated and more confident that they had provided correct annotations using Method 2.
Table 6.
 MentalPhysicalTemporalPerformanceEffortFrustrationOverall
Manual7.823.364.647.005.277.916.00
NL/Dialogue4.182.825.184.553.003.093.80
Table 6. Raw Average Scores from Participant Responses to NASA-TLX Questionnaires for Each Method on the 21-point Scale
The TLX measures: mental demand, physical demand, temporal demand, performance, effort, and frustration. For all but “Performance,” the scale runs from 0 = Very Low to 21 = Very High. For “Performance,” the scale runs from 0 = Perfect to 21 = Failure.
Fig. 18.
Fig. 18. Bar chart showing the NASA TLX results from Table 6. Note that the scoring scheme ranges from 0 to 21, but scores across the board for both approaches are below 9 and so the chart is truncated for readability.

9 Practicalities of Active Learning in Activity Recognition

The purpose of this section is to look at the “big picture” of active learning for activity recognition in the home, keeping in mind our underlying motivation: to be able to effectively monitor the activities of individuals living at home (often alone) so that we can effectively support them and their carers with assistive technology, including robotics. We maintain the assumption established earlier in this article, that for long-term activity recognition to be useful and reliable in practice, it is necessary to bring the user in-the-loop of the learning of their own activity recognition system in some way. We will discuss here some important questions about the practicalities of active learning for activity recognition, highlighting challenges, potential solutions from our work, and areas where there is more to learn (potential future work).

9.1 Model Design and Selection

While the purpose of active learning is to improve initial models over the long-term, whether the process is ultimately effective is partly contingent on the appropriateness of the model and its features. The assumption that is made when discussing active learning generally is that the models and features of the data are capable of representing the target variables, provided that there is sufficient training data. Of course, a model that performs poorly at the outset can only improve if there is sufficient variations in feature values under the different target classes. If the model is not capable of improving, then active learning might only frustrate the user through persistent querying at a constant rate as the system fails to learn.
That is to say, active learning does not make the task of designing and crafting the activity model any easier. Indeed, it actually requires further investigation and consideration as to whether the model is likely to improve in the given use case. In activity recognition, the trend toward “hybrid” models as in our online activity recognition example (Section 7), highlights the need for approaches to consider the types of potential knowledge that exist in the application domain and how that knowledge can be acquired in the environment.

9.2 Query Selection

As demonstrated in our own example scenarios, there is no one-size-fits-all solution to the issue of query selection. With data-driven activity recognition approaches (as in Section 6), where the model makes predictions on snapshots or windows (static or dynamic) of raw data, it seems sensible to take advantage of existing mathematical and information theory methods as tools for query selection. As the literature reflects, it is common and almost trivial in these cases to form a committee of learners to give us competing predictions that we can examine mathematically. In our case, we utilised Kullback–Leibler divergence to detect moments where maximum information could be gained from querying the user, and utilised a threshold ( \(T_{MD}\) ) to control the trigger point.
With hybrid or knowledge-driven approaches, multiple models become less practical, since such approaches ultimately embed a higher degree of subjective design decisions and methods than in purely data-driven approaches. We see that in practical activity recognition systems, it is necessary to perform online segmentation, but that online segmentation itself is typically ignored as a vector for query selection. In our online activity recognition scenario (Section 7), we used a single model, but presented it with different possible segmentations of the data so that we could compare the prediction results. In doing so, we considered both the semantic and numerical aspects of the activity recognition in our query selection process.

9.3 Frequency and Timing of Queries

Technically, how frequently the system issues queries to the user is a problem that is readily solved. In both scenarios we presented, we manipulated existing query selection thresholds ( \(T_{MD}\) / \(T_{MC}\) ) to influence the frequency of queries issued, and with the CASAS dataset, we showed the effect of modulating this threshold in terms of queries issued per day. It is of course possible to account for other logical and semantic constraints, such as not querying during the night, when the user is in the bathroom, and so on.
Arguably more important is an issue not well addressed in existing literature, which is user acceptance of active learning with real-time queries. How often are users willing to be disturbed by the agent querying their activities? When do they think it is (in)appropriate to be queried? How do they react to ill-timed queries? And so on. Exploration of these issues from a usability and human-robot interaction perspective is essential in creating acceptable solutions that respect the feelings and preferences of the end-user.
The position of this article has been that querying the user as close to the activity as possible is preferential over querying in retrospect. In our online system (Section 7), we allow queries to be issued before the activity has even completed. There is undoubtedly ground to explore somewhere between instant querying and labelling after the fact, perhaps taking into consideration the user constraints as outlined above. A future approach may, for example, respect a user’s wish to not bother them when they are in the kitchen, instead opting to wait until they have left that area to initiate a query dialogue.

9.4 Applying a Label to Data

A challenge of using active learning with timeseries data lies in determining to which data points the label should be applied. In online activity recognition, at the most basic level, we have fixed windows of data. In which case, the label might be applied to the window that led to the query being raised in the first place. However, since the fixed window is often an imperfect segmentation of data, this noise can transfer over to the labelled training sample.
In our online activity recognition scenario (Section 7), we wanted to try utilising the dialogue system as a tool to accommodate the possibility that the correct segment of data might depend on the answer given by the user. A natural drawback of our interface, depending on an NLP pipeline, is that sometimes we will be unable to successfully extract a label from the user. In this case, a segment of data would go unlabelled, losing out on a valuable learning opportunity. In future, it would be worthwhile investigating hybrid interfaces that use dialogue as the primary mode of interaction but that can fall back to a secondary interface (e.g., a mobile app) to allow: (i) after-the-fact labelling of samples where the dialogue system failed to extract a label; and (ii) labelling of samples where the system was unable to query the user at the time (e.g., due to user preferences).

10 Reflection On Societal Impact

It is our position that activity recognition systems will have their most positive societal impact by allowing remote monitoring of elderly individuals and those with physical and/or cognitive impairments. This monitoring may enable: early detection of cognitive impairments, such as Alzheimer’s disease; the detection of emergency situations, such as falls; pro-active robotic assistance before, during, and after an activity; peace of mind for family and friends; among other things. The potential positive impacts of our work are therefore in its contributions to activity recognition: (i) that a dialogue system can provide an effective means of annotating daily activities; (ii) that this allows us to rely less on supervised learning, and instead adopt an active learning approach; and (iii) that this active learning will enable person-centric activity recognition systems to better deal with long-term changes in user behaviour and in the environment.
It is important to also consider the potential negative consequences of the proliferation of our work within the context of society as many readers may know it today. First, our approach relies on the use of microphones within a user’s home, to enable on-demand dialogue and conversation. Undoubtedly, the simple existence of this data collection modality within the home makes it ripe for exploitation by bad actors, whether individual, corporate, or government-backed.
Ultimately, the system described here gathers responses to the question “What are you doing?” at various points during a user’s life. This is overt collection of private data, which we argue would naturally be of immense value to organisations in the social media and advertising spheres, who often rely on indirect data gathering to profile individuals.
For these reasons, we would be remiss not to highlight that successful commercialisation of approaches like ours has the potential to exacerbate or enable invasions of individual privacy via smart home technology.
In terms of how our work might impact inequalities, we should consider it as a piece of assistive technology. Such technology often comes with a price tag that for many is out of reach, therefore, it is always possible that it may amplify existing wealth and health outcome inequalities in society.

11 Conclusion

In this article, we have presented a dialogue-based approach to self-annotation of daily activities (ADLs) and have applied it to two active learning scenarios. Furthermore, we have demonstrated the use of natural language descriptors as a means of relating semantic user-level labels of activities to a specific ADL label through the use of semantic similarity measures, which are one way of handling uncertainty in user utterances. Along with follow-up questions, we can extract a user-provided ADL annotation at a reasonable level of granularity.
Our approach shows the potential of dialogue-based user interfaces as a means of obtaining labels from the user near the time of the query, rather than having users annotate data after-the-fact (e.g., at the end of the day). Likewise, it does not require manual intervention on the part of the user, since the system initiates dialogue. Results from our user study indicate that user perceptions toward our approach are positive, particularly when compared to a more manual speech-based annotation method that requires initiative on the part of the user.
We have demonstrated the usefulness of our approach, using a semantic similarity measure to relate a label extracted from a user with a specific label used by a particular model/dataset, offering an approach suited to reuse. While applied here to activity recognition, we see that this approach may have other applications with many active learning problems where the user is asked to provide commentary on data points and events. A major strength is that by utilising domain knowledge, in this case we have ADL descriptors, the approach is agnostic to the learners after a simple mapping procedure.
We have addressed here a limitation of our previous work in Reference [48], which was originally demonstrated applied only to a single existing dataset. Here, we have applied our dialogue system to an online activity recognition system in our own testbed.
In our online activity recognition approach, we also introduced novelty by using our online segmentation process to generate appropriate queries. In doing so, we allowed the user response to influence the segmentation of data before a label is applied to it and moved to the training set. Here, human input not only helps the system by labelling samples as in conventional active learning, they help the system segment the sample itself. We believe this exemplifies the ethos of “human-in-the-loop” assistive technology: The human helps the system to better help the human.
In producing our results, we have put together and made publicly available a modest dataset that contains natural language descriptors for 33 common ADLs (including “other”). This dataset could ultimately be expanded; however, return on investment degrades with every additional participant due to duplicate responses. Already reasonable coverage is achieved, in our opinion, of the most common phrases we would expect to hear in British English. Certainly, a wider pool of samples, we expect, would only improve the effectiveness of our semantic similarity approach.
Here, we have been able to examine further the usefulness of our approach in practice, and to highlight also the challenges and requirements of active learning for activity recognition under real-world conditions (Section 9). Evidently, there are other aspects of active learning user interfaces that should be investigated in future work. In particular, there is scope for further study involving human participants to assess the impact of timing and frequency of queries, and the impact of after-the-fact labelling on data purity. From a technical standpoint, it is possible to consider alternative online segmentation strategies to evaluate their segmentation accuracy on larger datasets. Addressing the challenges we have identified may also generate more complex dialogues, and could allow us to study how users respond to more complex questions that require more thoughtful effort on their part to answer.

Footnotes

1
Datasets published under CASAS can be found here: http://casas.wsu.edu/datasets/
3
Note that we state the sum of probabilities here as being less than or equal to one. This is to reflect the fact that in practice some probabilistic models would not yield results totalling to one. This is the case, for example, in our approach in Section 7, where the model is estimating the most probable truth values of specific queries for each class: It is possible that the evidence does not support any class and that the sum of probabilities may be at or close to zero.
6
Models described here: https://spacy.io/models/en
7
Datasets published under CASAS can be found here: http://casas.wsu.edu/datasets/
9
The complete list of potential boundary detection triggers listed by the authors are: time, space, objects, characters, character interaction, causes, and goals [8].

References

[1]
Hande Alemdar, Tim L. M. van Kasteren, and Cem Ersoy. 2011. Using active learning to allow activity recognition on a large scale. In Ambient Intelligence, David V. Keyson, Mary Lou Maher, Norbert Streitz, Adrian Cheok, Juan Carlos Augusto, Reiner Wichert, Gwenn Englebienne, Hamid Aghajan, and Ben J. A. Kröse (Eds.). Vol. 7040. Springer, Berlin, 105–114. Retrieved from
[2]
Muhannad Alomari, Paul Duckworth, Nils Bore, Majd Hawasly, David C. Hogg, and Anthony G. Cohn. 2017. Grounding of human environments and activities for autonomous robots. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, 1395–1402.
[3]
Salikh Bagaveyev and Diane J. Cook. 2014. Designing and evaluating active learning methods for activity recognition. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing Adjunct Publication (UbiComp’14). ACM Press, 469–478.
[4]
Tom Bocklisch, Joey Faulkner, Nick Pawlowski, and Alan Nichol. 2017. Rasa: Open source language understanding and dialogue management. Retrieved from http://arxiv.org/abs/1712.05181
[5]
Norman M. Bradburn, Lance J. Rips, and Steven K. Shevell. 1987. Answering autobiographical questions: The impact of memory and inference on surveys. Science 236, 4798 (Apr.1987), 157–161.
[6]
Leo Breiman. 1996. Bagging predictors. Mach. Learn. 24, 2 (Aug.1996), 123–140.
[7]
Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5–32.
[8]
Eleonora Ceccaldi, Nale Lehmann-Willenbrock, Erica Volta, Mohamed Chetouani, Gualtiero Volpe, and Giovanna Varni. 2019. How unitizing affects annotation of cohesion. In Proceedings of the 8th International Conference on Affective Computing and Intelligent Interaction (ACII’19). IEEE, 1–7.
[9]
Guangchun Cheng, Yiwen Wan, Bill P. Buckles, and Yan Huang. 2014. An introduction to markov logic networks and application in video activity analysis. In Proceedings of the 5th International Conference on Computing, Communications and Networking Technologies (ICCCNT’14). IEEE, 1–7.
[10]
Kenneth Ward Church. 2017. Word2Vec. Natur. Lang. Eng. 23, 1 (Jan.2017), 155–162.
[11]
Gabriele Civitarese, Claudio Bettini, Timo Sztyler, Daniele Riboni, and Heiner Stuckenschmidt. 2018. NECTAR: Knowledge-based collaborative active learning for activity recognition. In Proceedings of the IEEE International Conference on Pervasive Computing and Communications (PerCom’18). IEEE, 1–10.
[12]
Gabriele Civitarese, Claudio Bettini, Timo Sztyler, Daniele Riboni, and Heiner Stuckenschmidt. 2019. newNECTAR: Collaborative active learning for knowledge-based probabilistic activity recognition. Pervas. Mobile Comput. 56 (May2019), 88–105.
[13]
Gabriele Civitarese, Timo Sztyler, Daniele Riboni, Claudio Bettini, and Heiner Stuckenschmidt. 2021. POLARIS: Probabilistic and ontological activity recognition in smart-homes. IEEE Trans. Knowl. Data Eng. 33, 1 (Jan.2021), 209–223.
[14]
David Cohn, Les Atlas, and Richard Ladner. 1994. Improving generalization with active learning. Mach. Learn. 15, 2 (May1994), 201–221.
[15]
Diane Cook. 2012. Learning setting-generalized activity models for smart spaces. IEEE Intell. Syst. 27, 1 (Jan.2012), 32–38.
[16]
Pedro Domingos and Daniel Lowd. 2009. Markov logic: An interface layer for artificial intelligence. Synth. Lect. Artific. Intell. Mach. Learn. 3, 1 (Jan.2009), 1–155.
[17]
AIML Foundation. 2018. Artificial Intelligence Markup Language. Retrieved from http://www.aiml.foundation/
[18]
Labiba Gillani Fahad, Asifullah Khan, and Muttukrishnan Rajarajan. 2015. Activity recognition in smart homes with self verification of assignments. Neurocomputing 149 (Feb.2015), 1286–1298.
[19]
Joon Ho Lee, Myoung Ho Kim, and Yoon Joon Lee. 1993. Information retrieval based on conceptual distance in IS-A Hierarchies. J. Document. 49, 2 (Feb.1993), 188–207.
[20]
Xin Hong and Chris D. Nugent. 2009. Partitioning time series sensor data for activity recognition. In Proceedings of the 9th International Conference on Information Technology and Applications in Biomedicine. IEEE, 1–4.
[21]
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. https://spacy.io/
[22]
Enamul Hoque, Robert F. Dickerson, and John A. Stankovic. 2014. Vocal-diary: A voice command-based ground-truth collection system for activity recognition. In Proceedings of the Wireless Health on National Institutes of Health (WH’14). ACM Press, 1–6.
[23]
Enamul Hoque and John Stankovic. 2012. AALO: Activity recognition in smart homes using active learning in the presence of overlapped activities. In Proceedings of the 6th International Conference on Pervasive Computing Technologies for Healthcare. IEEE.
[24]
H. M. Sajjad Hossain, Md Abdullah Al Haiz Khan, and Nirmalya Roy. 2018. DeActive: Scaling activity recognition with active deep learning. Proc. ACM Interact., Mobile, Wear. Ubiq. Technol. 2, 2 (July2018), 1–23.
[25]
H. M. Sajjad Hossain, Md Abdullah Al Hafiz Khan, and Nirmalya Roy. 2017. Active learning enabled activity recognition. Pervas. Mobile Comput. 38 (July2017), 312–330.
[26]
Dan Jurafsky and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd ed.). Pearson Prentice Hall, Upper Saddle River, NJ.
[27]
Ross A. Knepper, Stefanie Tellex, Adrian Li, Nicholas Roy, and Daniela Rus. 2015. Recovering from failure by asking for help. Auton. Robots 39, 3 (Oct.2015), 347–362.
[28]
Narayanan C. Krishnan and Diane J. Cook. 2014. Activity recognition on streaming sensor data. Pervas. Mobile Comput. 10 (Feb.2014), 138–154.
[29]
Narayanan C. Krishnan and Sethuraman Panchanathan. 2008. Analysis of low resolution accelerometer data for continuous human activity recognition. In 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 3337–3340.
[30]
Daniele Liciotti, Michele Bernardini, Luca Romeo, and Emanuele Frontoni. 2020. A sequential deep learning application for recognising human activities in smart homes. Neurocomputing 396 (2020), 501–513.
[31]
Rong Liu, Ting Chen, and Lu Huang. 2010. Research on human activity recognition based on active learning. In Proceedings of the International Conference on Machine Learning and Cybernetics. IEEE, 285–290.
[32]
Nattaya Mairittha and Tittaya Mairittha. 2019. A dialogue-based annotation for activity recognition. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the ACM International Symposium on Wearable Computers. 6.
[33]
Prem Melville and Raymond J. Mooney. 2004. Diverse ensembles for active learning. In Proceedings of the 21st International Conference on Machine Learning (ICML’04). ACM, 74.
[34]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. Retrieved from http://arxiv.org/abs/1301.3781
[35]
Dipendra K. Misra, Jaeyong Sung, Kevin Lee, and Ashutosh Saxena. 2016. Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions. Int. J. Robot. Res. 35, 1–3 (Jan.2016), 281–300.
[36]
Tudor Miu, Paolo Missier, and Thomas Plotz. 2015. Bootstrapping personalised human activity recognition models using online active learning. In Proceedings of the IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing. IEEE, 1138–1147.
[37]
Saad Mohamad, Moamar Sayed-Mouchaweh, and Abdelhamid Bouchachia. 2020. Online active learning for human activity recognition from sensory data streams. Neurocomputing 390 (May2020), 341–358.
[38]
Daniel Nyga, Mareike Picklum, and Michael Beetz. 2013. pracmln—Markov logic networks in Python. Retrieved from http://www.pracmln.org/
[39]
George Okeyo, Liming Chen, Hui Wang, and Roy Sterritt. 2014. Dynamic sensor data segmentation for real- time activity recognition. Pervasive and Mobile Computing 10 (2014), 155–172.
[40]
Javier Ortiz Laguna, Angel García Olaya, and Daniel Borrajo. 2011. A dynamic sliding window approach for activity recognition. In User Modeling, Adaption and Personalization, Joseph A. Konstan, Ricardo Conejo, José L. Marzo, and Nuria Oliver (Eds.). Vol. 6787. Springer, Berlin, 219–230.
[41]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 (2011), 2825–2830.
[42]
Hoifung Poon and Pedro Domingos. 2006. Sound and efficient inference with probabilistic and deterministic dependencies. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1 (AAAI’06, Boston, Massachusetts). AAAI Press, 458–463.
[43]
Philip Resnik. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artific. Intell. Res. 11 (Jul 1999), 95–130.
[44]
Burr Settles. 2009. Active Learning Literature Survey. http://digital.library.wisc.edu/1793/60660
[45]
H. S. Seung, M. Opper, and H. Sompolinsky. 1992. Query by committee. In Proceedings of the 5th Annual Workshop on Computational Learning Theory (COLT’92). ACM, 287–294.
[46]
Sidney Katz. 1983. Assessing self-maintenance: Activities of daily living, mobility, and instrumental activities of daily living. Journal of the American Geriatrics Society 31, 12 (1983), 721–727.
[47]
Deepika Singh, Erinc Merdivan, Ismini Psychoula, Johannes Kropf, Sten Hanke, Matthieu Geist, and Andreas Holzinger. 2017. Human activity recognition using recurrent neural networks. Retrieved from https://arXiv:1804.07144
[48]
Ronnie Smith and Mauro Dragone. 2022. A dialogue-based interface for active learning of activities of daily living. In Proceedings of the 27th International Conference on Intelligent User Interfaces. ACM, Helsinki Finland, 820–831.
[49]
Maja Stikic, Kristof Van Laerhoven, and Bernt Schiele. 2008. Exploring semi-supervised and active learning for activity recognition. In Proceedings of the 12th IEEE International Symposium on Wearable Computers. IEEE, 81–88.
[50]
Emmanuel Munguia Tapia. 2003. Activity Recognition in the Home Setting Using Simple and Ubiquitous Sensors. Ph.D. Dissertation.
[51]
Michel Vacher, Benjamin Lecouteux, Pedro Chahuara, François Portet, Brigitte Meillon, and Nicolas Bonnefond. 2014. The sweet-home speech and multimodal corpus for home automation interaction. In Proceedings of the 9th Edition of the Language Resources and Evaluation Conference (LREC’14). 9.
[52]
Tim van Kasteren, Athanasios Noulas, Gwenn Englebienne, and Ben Kröse. 2008. Accurate activity recognition in a home setting. In Proceedings of the 10th International Conference on Ubiquitous Computing (UbiComp’08). ACM Press, 1.
[53]
T. L. M. van Kasteren, G. Englebienne, and B. J. A. Kröse. 2010. An activity monitoring system for elderly care using generative and discriminative models. Person. Ubiq. Comput. 14, 6 (Sept.2010), 489–498.
[54]
Liang Wang, Tao Gu, Xianping Tao, and Jian Lu. 2012. A hierarchical approach to real-time activity recognition in body sensor networks. Pervas. Mobile Comput. 8, 1 (Feb.2012), 115–130.
[55]
Juan Ye, Graeme Stevenson, and Simon Dobson. 2014. USMART: An unsupervised semantic mining activity recognition technique. ACM Trans. Interact. Intell. Syst. 4, 4 (Nov.2014), 1–27. M14.
[56]
Kazuhiro Yoshiuchi, Yoshiharu Yamamoto, and Akira Akabayashi. 2008. Application of ecological momentary assessment in stress-related diseases. BioPsychoSocial Med. 2, 1 (Dec.2008), 13.
[57]
Jeffrey M. Zacks, Nicole K. Speer, Khena M. Swallow, Todd S. Braver, and Jeremy R. Reynolds. 2007. Event perception: A mind-brain perspective. Psychol. Bull. 133, 2 (Mar.2007), 273–293.
[58]
Liyue Zhao, Gita Sukthankar, and Rahul Sukthankar. 2011. Robust active learning using crowdsourced annotations for activity recognition. In Proceedings of the Workshops at the 25th AAAI Conference on Artificial Intelligence. 7.
[59]
Yue Zhao, Ciwen Xu, and Yongcun Cao. 2006. Research on query-by-committee method of active learning and application. In Advanced Data Mining and Applications, Xue Li, Osmar R. Zaüane, and Zhanhuai Li (Eds.). Vol. 4093. Springer, Berlin, 985–991.
[60]
Chun Zhu and Weihua Sheng. 2009. Human daily activity recognition in robot-assisted living using multi-sensor fusion. In Proceedings of the IEEE International Conference on Robotics and Automation. IEEE, Kobe, 2154–2159.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Interactive Intelligent Systems
ACM Transactions on Interactive Intelligent Systems  Volume 13, Issue 3
September 2023
263 pages
ISSN:2160-6455
EISSN:2160-6463
DOI:10.1145/3623489
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2023
Online AM: 14 August 2023
Accepted: 07 August 2023
Revised: 28 October 2022
Received: 20 June 2022
Published in TIIS Volume 13, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Human-in-the-Loop (HITL) annotation
  2. Active Learning (AL)
  3. natural language
  4. semantic similarity
  5. Human Activity Recognition (HAR) labelling

Qualifiers

  • Research-article

Funding Sources

  • Engineering and Physical Sciences Research Council (EPSRC) Centre for Doctoral Training (CDT) in Robotics and Autonomous Systems

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 711
    Total Downloads
  • Downloads (Last 12 months)619
  • Downloads (Last 6 weeks)86
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media