1 Introduction
User identification is an important step towards achieving and maintaining a personalised long-term interaction with robots. For instance, a user may need to be identified for providing personalised rehabilitation therapy [
41]. When a robot is first deployed, it will start from a “tabula rasa” state with no prior knowledge of users. As users are encountered over a possibly extended period of time, their identity and information are stored by the robot. Hence, the system has to identify enrolled and “unknown” users, which is known as
open-set identification. Open-set identification is a well-established field [
48,
76,
77], but in a real-world setting, these unknown users might need to be added into the system for future recognition. One solution is to retrain the system after introducing a novel user. However, this requires storing the previous samples, which could create a prohibitively large computational burden in long-term deployments. Furthermore, it would require a significant amount of time to retrain with a growing number of users and samples [
8]. Instead, the system should allow scaling and support incremental learning of new classes, which is termed
open world recognition [
8].
Face recognition (FR), i.e., identifying a person based on their face, has been the most prominent technique in biometric identification due to its non-intrusive character. Most state-of-the-art methods use deep learning-based approaches [
68,
79,
80,
81], but only a few approaches exist for open-set recognition [
9,
33]. Most models are not suitable for open world recognition due to the
catastrophic forgetting problem, which refers to the drastic loss of performance on previously learned classes when a new class is introduced [
62,
63,
66]. Existing approaches that could help to overcome this problem often require a part of the previous data for retraining, which might not be available.
Incremental learning is not sufficient for adapting to changes in the environment. For instance, an algorithm designed for open world recognition may not be able to recognise a person after a new haircut, because the model is not updated for known samples. Humans show a good model for recognition, because they can continuously adapt to changing circumstances by updating their prior beliefs, known as
online learning, and use multi-modal information instead of a single biometric (modality) for estimation of the identity, such as recognising a person from the voice in a dark room. Biometric systems that combine multiple biometric traits or attributes obtained through the same sensor (e.g., face and iris [
16,
21,
83,
87]) or various sensors (e.g., face and voice [
10,
17,
18,
58,
82]) for establishing identity are known as
multi-modal biometric systems [
24,
47]. Most robots are also suitable for multi-modal recognition, as they have multiple sensors and perception algorithms (as shown in Figure
1), which allow them to recognise users even when data are inaccurate or noisy, for example, in the case of image blur or illumination changes [
85]. Moreover, the combination of multi-modal data can help overcome issues related to similarities between users
1 by differentiating on additional available information, for example, age and gender. Such ancillary physical or behavioural characteristics, called
soft biometrics, can be used to improve the recognition performance [
24,
45,
47]. Combining multi-modal recognition with online learning can improve recognition further in time. For instance, a user can be initially mistaken for another in certain circumstances, but these variations can be learned over time and combined with other modalities to improve recognition where FR fails.
In our earlier work [
43], we proposed a multi-modal weighted Bayesian Network with online learning, which is the first approach for combining soft biometrics (gender, age, height, and time of interaction) with a primary biometric (face recognition) for open world user identification in real-time
human-robot interaction (HRI). This model, here referred to as
Multi-modal Incremental Bayesian Network (MMIBN), is the first method for sequential and incremental learning in open world user recognition that allows starting from a state without any known users (i.e., it does not require preliminary training to recognise users and it can learn new users incrementally). This work showed that the proposed model is suitable for real-world human-robot interaction experiments for user recognition in real-time. However, the limited population size (14 users) and the narrow age range (24–40) of the users in that experiment prevented us from claiming that the results can be generalised for application in larger populations. However, obtaining a dataset that encapsulates a diverse set of characteristics for a large number of users over long-term interactions is a laborious task in HRI. Thus, we created the Multi-modal Long-Term User Recognition Dataset,
2 which contains images of 200 users (with age range 10 to 63) with name, gender, age, and height labels, along with artificially generated height estimations and various time of interactions to simulate a long-term HRI scenario. We obtained the images from the largest publicly available dataset of face images with gender and age labels,
IMDB-WIKI dataset
3 [
71,
72]. To obtain the multi-modal biometric information from these images (face, gender, and age estimations), we used (NAOqi) proprietary algorithms of the Pepper robot,
4 similar to our earlier work.
Our main contribution is the extension of our earlier work [
43] to take in multi-modal information, typically available in HRI, to markedly increase user identification and subsequently improve user experience in long-term interactions for a large number of users in a variety of settings. We also provide a detailed description of the Multi-modal Incremental Bayesian Network, highlighting the mathematical formulations and assumptions behind the models that were not addressed in Reference [
43]. In addition, we present our findings from applying the optimised models in long-term HRI experiments in the real world [
41,
42,
43]. Correspondingly, we make the following contributions (source code, multi-modal dataset, trained models, and results on the dataset are available online
\(^{2}\)):
•
creating the Multi-modal Long-term User Recognition Dataset with 200 users of varying characteristics;
•
introducing long-term recognition performance loss;
•
combining optimal normalisation methods for each parameter in the Bayesian network in a hybrid approach;
•
formulating the proposed online learning in terms of Expectation Maximization (EM) and Maximum Likelihood (ML);
•
applying Bayesian optimisation on the weights of the soft biometric identifiers and the quality of the estimation;
•
evaluating the proposed model against a state-of-the-art open world recognition method (Extreme Value Machine [
73]);
•
evaluating the stability of the model for learning users sequentially (similar to batch learning) and at random intervals (similar to a real-world scenario);
•
evaluating the generalisability of the model for new users (performance during training set in comparison to open-set and closed-set recognition);
•
evaluating the model for varying frequency of user appearances (modelled with uniform and Gaussian timing of interaction, and varying dataset sizes);
•
evaluating the progress of the model over time (with the increasing number of recognitions);
•
analysing recognition bias in face recognition, the proposed approach, and Extreme Value Machine;
•
evaluating the models on the data from the real-world HRI study (four weeks) in Reference [
43] in comparison to the corresponding optimised models;
•
evaluating the model in a real-world (five-day) HRI study with a personalised barista robot at an international student campus in Paris (France);
•
evaluating the models in a long-term (five months) HRI study within a cardiac rehabilitation programme at a hospital in Bogotá (Colombia).
The rest of the article is organised as follows: Section
2 gives a brief overview of the current practice of open world recognition, online learning, multi-modal biometrics algorithms, and user recognition in HRI. Section
3 describes the methodology and the structure of the proposed Bayesian network. Section
4 describes the recognition module for NAOqi that is used to obtain the multi-modal biometric information for the proposed model.
Section
5 explains the procedure of the creation of the Multi-modal Long-term User Recognition Dataset. Section
6 presents the empirical evaluation of the proposed methods on closed-set and open-set datasets. Section
7 highlights the implications of the results and discusses the initial assumptions. Section
8 evaluates the optimised models in long-term HRI studies in the real world. Section
9 concludes with a summary of the work.
3 Multi-modal Incremental Bayesian Network
A Bayesian network is a probabilistic graphical model that represents conditional dependencies of a set of variables through a directed acyclic graph. Bayesian networks are suitable for combining scores of identifiers with uncertainties when the knowledge of the world is incomplete [
78].
We developed a weighted
multi-modal incremental Bayesian network (MMIBN), integrating multi-modal biometric information for reliable recognition in open world identification through a naive Bayes model (see Figure
2). The naive Bayes classifier model assumes conditional independence between predictors, which is a reasonable assumption for a multi-modal biometric identifier, as the individual identifiers do not affect each other’s results.
The architecture for the estimation of the user identity (
\(I\)) in MMIBN and the recognition process are presented in Figures
14 and
15 in Appendix
A. The primary biometric in our system is face recognition (
\(F\)), which is fused with soft biometrics, namely, gender (
\(G\)), age (
\(A\)), and height (
\(H\)) estimations, in addition to the time of interaction (
\(T\)), which can be distinguishing if the users are encountered at patterned interaction times, such as for weekly appointments in rehabilitation.
We hypothesise that the integration of these soft biometrics will reduce the effects of noisy data, as described in Section
1, and increase the identification rate. Nonetheless, the MMIBN allows extension with other primary biometric traits, such as voice and fingerprint, and other soft biometrics, such as eye colour and gait, to improve recognition.
The pyAgrum
5 [
36] library is used for implementing the Bayesian network structure. Parts of MMIBN were previously described in our prior work [
43], however, this section provides the underlying mathematical formulations and full details of the system for reproducibility, and introduces the long-term recognition performance loss (Section
3.6) and hybrid normalisation (Section
3.7).
3.1 Structure
The number of states for each node depends on the modality: F and \(I\) nodes have \(n_e\)+1 states, where \(n_e\) is the number of enrolled (known) users. \(A\) and \(H\) nodes are restricted to the available range of the identifier, such as \([0,75]\) for \(A\) and \([50,240]\) (cm) for \(H\). \(G\) has “female” and “male” states. \(T\) is defined by the day of the week and the time, through time slots. For example, if each minute corresponds to a time slot (i.e., time period, \(t_p\), is 1 min), then there will be 10,080 \(T\) states (there are 10,080 minutes in a week).
When a user is encountered, the corresponding multi-modal biometric evidence is collected from the identifiers. An example for the biometric evidence from the identifiers and the transformed (weighted and normalised) evidence is shown in Figure
16 B in Appendix
A. FR provides similarity scores, which give the percentage of similarity of the user to the known faces in the database. Age, height, and time are assumed to be discrete random variables with a discretised and normalised normal distribution of probabilities,
\(N(\mu , \sigma ^2)\), defined by (
1), where
\(V\) is the estimated value,
\(Z\) is the standard score, and
\(C\) is the confidence of the biometric indicator for the estimated value.
The time period and its standard deviation (\(\sigma _t\) in the normal distribution of \(T\)) can be set depending on the precision required in the application. A smaller time period and standard deviation ensure higher precision, however, this would increase the complexity of the Bayesian network, thereby increasing the time to identify the user. In addition, a higher precision carries the risks of decreasing the recognition rate if the users are not encountered near the time slot that they were previously seen.
For example, if users in the application scenario will change every 5 minutes, then \(t_p=5\) min and \(\sigma _t=15\) min would be reasonable. However, in an HRI scenario, \(t_p=30\) min with \(\sigma _t=60\) min can allow better identification, because it is less likely to encounter users around the same time every day. Hence, we use the latter in this article.
3.2 Weights of the Network
Soft biometric traits are characteristics that are not suited to identify an individual uniquely. We can assume that the population will have similar characteristics, but the distribution is unknown. However, some soft biometric features may contain more information about an individual than others, e.g., age is often more informative than gender. This can be modelled by using different weights for the parameters in a Bayesian network [
45].
Weights (
\(w_i\)) are used as the exponential to the likelihoods of the child nodes (
\(X_i\)), similar to the work in Reference [
88]. In contrast to our previous work [
43], we optimise the weights of soft biometric features (gender, age, height, and time of interaction) through Bayesian optimisation, as described in Appendix
C.6, while the weight of the face node (
\(w_F\)) is set to be 1, as it is the only primary biometric in our system. The posterior probability
\(P(I^j| X_1, \ldots , X_n)\) is approximated as in Equation (
2).
\(I^j\) stands for the
\(j\)th user (
\(I=j\)), where
\(I\) is the identity node.
As in Reference [
45], we assume that the identifiers perform equally well on all users. Therefore, the accuracy of an identifier is independent of the user and equal priors are assumed for each of the identifiers.
The posterior probability simplifies to Equation (
3):
Because the distribution of users over time is not known, one approach for determining \(P(I^j)\) is to use adaptive priors using frequencies of user appearance, however, this can create a bias in the system towards the most frequently observed user, as it affects the posterior probability directly, thus, may result in a decrease in the identification rate.
Therefore, we assume that the probability of encountering user
\(j\) is equally likely as encountering user
\(m\), hence, we assume equal priors for
\(P(I)\), as shown in Equation (
4), where
\(n_{e}\) is the number of enrolled users, which is updated whenever a new user is enrolled, as presented in Figure
16 in Appendix
A.
3.3 Quality of the Estimation
Algorithms for open-set problems generally use a threshold (e.g., over the highest probability/score) to determine if the user is already enrolled or “unknown.” However, the resulting posterior probabilities in a Bayesian network can be low due to the multiplication of the conditionally independent modalities and vary depending on the number of states. Hence, we use the two-step ad hoc mechanism introduced in Reference [
43] to transform the Bayesian network to allow open-set recognition: (1) An “Unknown” (
\(U\)) state is used in both
\(F\) and
\(I\) nodes. The similarity score in FR of
\(U\) is set to the FR threshold (
\(\theta _{FR}\)), such that when normalised, scores below/above the threshold will have lower/higher probabilities than
\(U\). This allows maintaining the threshold for the FR system in use. (2) We use the confidence measure called the
quality of the estimation (
\(Q\)). Given the evidence
\(y_t\) at time
\(t\), it compares the highest posterior probability (
\(P_w\)) to the second highest (
\(P_s\)), as shown in Equation (
5). The difference between the probabilities decreases, as the number of enrolled users (
\(n_e\)) increases, since
\(\sum _{j} P(I^j|y_t)=1.0\). A similar method was used in Reference [
31] for estimating the quality of localisation based on different images.
Using the quality of the estimation enables decreasing misidentifications. For example, the highest posterior score can be very high, but if the second highest posterior is very close to it, then it means that there are two possible strong candidates for the current user. If the system were to identify the user in this case, then the resulting misidentification could cause adverse effects on the current user especially in the case of different genders or age differences between the two users, as well as security issues. Thus, it is more preferable to identify the user as unknown if the quality is zero or below a predetermined threshold (
\(\theta _Q\)) or if
\(U\) has the highest posterior probability. Otherwise, the identity is estimated with a
maximum a posteriori (MAP) estimation, given in Equation (
6).
3.4 Incremental Learning
For personalisation in long-term HRI applications, new users may often need to be enrolled in a system to allow recognition in subsequent encounters, such as for admitting a new patient to personalised robot therapy. However, in such applications, the intermediary (e.g., clinical staff) and end-users (e.g., patients) are often non-experts, hence, systems that require the least amount of technical knowledge, effort, and time are desirable, especially those that allow users to enrol themselves. Thus, we developed an incremental learning system for the weighted multi-modal Bayesian network, which expands the network upon new user enrolment. When the MMIBN detects that the user is new, the robot requests to meet the user, and (verbally) asks for their name, gender, birth year, and height, which the user can enter through a tablet interface, after which a photo of the user is taken by the robot (step 9 in Figure
15).
This information, along with the time of interaction, is gathered to have the ground truth values for recognition and for setting the initial likelihoods of the MMIBN.
Initially, the system starts from a “tabula rasa” state, where there are no known users. Bayesian network is formed when the first user is enrolled: one state for the new user and one for the “Unknown” (
\(U\)) state. Figure
16 A (in Appendix
A) illustrates an example for the initial MMIBN after the enrolment of the first user, e.g., a 25-years-old female who is 168 cm tall and encountered at 11:00 am on a Monday. The initial likelihood for
\(F\) is set to be much higher for the true values, as shown in Equation (
7), where
\(w_F\) is the weight of the face variable, and
\(n_e\) is the number of enrolled users. The value was found based on preliminary experiments.
The remaining likelihoods are set using the prior knowledge that the user entered in a similar structure to the evidence for age, height, and time variables with a discretised and normalised normal distribution,
\(N(\mu , \sigma ^2)\), where
\(\mu\) is the true value (e.g., age of the person), and
\(\sigma\) is the standard deviation of the identifier. Gender is set at
\([0.99^{w_G}, 0.01^{w_G}]\) ratio, which is experimentally found. For the unknown state,
\(P(X_i^k|I^U)\) is set to be uniformly distributed, as an unknown user can be of any age, height, and be recognised at any time of the day, except for the face node, which follows Equation (
7).
When a new user is enrolled, the Bayesian network is expanded by adding a new state to
\(I\) and
\(F\) nodes.
\(P(F^k|I^j)\) for each previous state in
\(I\) (including
\(U\)) is updated by appending the value corresponding to
\(k \ne j\) condition in Equation (
7), and then probabilities are re-normalised. The likelihoods of
\(G\),
\(A\),
\(H,\) and
\(T\) nodes for the previously enrolled users remain the same. An example of the MMIBN likelihoods during incremental learning of a new user, e.g., a 37-years-old male, 173 cm tall, and encountered on a Wednesday at 8:00 pm, is illustrated in Figure
16 E in Appendix
A.
The scalability feature removes the need to retrain the network when a new user is introduced, hence, the time complexity is decreased, which can be crucial if the new user is introduced at a later step (e.g., after 1,000 users). More precisely, if each image corresponding to \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}n_o\hspace{-0.83328pt}}\hspace{0.83328pt}\) average number of observations per user was to be recognised again after a new user is added to the face database, then it would take a significant amount of time to expand the network compared to scaling, since \(n_e*\hspace{0.83328pt}\overline{\hspace{-0.83328pt}n_o\hspace{-0.83328pt}}\hspace{0.83328pt}*\mathcal {O}(FR) \gg n_e*\mathcal {O}(1)\) updates, where \(\mathcal {O}(FR)\) is the time complexity of the FR algorithm, and \(n_e\) is the number of enrolled users.
To reduce the risk of confusing new users with known users, it is preferable to have sufficient data within the MMIBN prior to making reliable estimations, hence, in the first few recognitions (here, we chose
\(N\lt N_{min}=5\) recognitions, i.e., the first 4 recognitions),
6 the identity is declared as unknown, regardless of the estimated identity, as illustrated in Figure
16 C (Appendix
A).
3.5 Online Learning of Likelihoods
Bayesian network parameters are generally determined by expert opinion or by learning from data [
51]. The former can cause incorrect estimations if the set probabilities are not accurate enough. The latter, for which
Maximum Likelihood (ML) estimation is commonly used, is not possible when the Bayesian network is constructed with incomplete data. One solution is to use offline batch learning, however, it requires storing data that can cause memory problems in long-term interactions. Another approach is to update the parameters as the data arrive, which is termed online learning. Variants of
Expectation Maximization (EM) algorithm with a learning rate (EM(
\(\eta\))) [
6,
20,
57,
59] have been proposed for online learning in Bayesian networks.
We use a Bayesian network where the likelihoods are updated through EM(
\(\eta\)) with an adaptive
\(\eta\) (learning rate) based on ML estimation, similar to Voting EM [
20]. Adopting the notation in Reference [
6], the formulation is given in Equation (
8).
\(\theta _{ijk}^{t}\) represents the likelihood of the modality
\(X_i\) at time
\(t\),
\(P(X_i=x_i^k|I^j)\).
\(P_{\theta ^t}(x_{i}^{k}|y_{t}, I^{j})\) represents the posterior probability of the modality
\(X_i\) at time
\(t\) given the current evidence
\(y_t\) and the actual identity of the user
\(I^j\). The difference between Voting EM and our approach is that we work with continuous probabilities due to uncertainties in the identifiers. We will refer to the proposed multi-modal incremental Bayesian network with online learning as MMIBN:OL.
Combining ML estimate to achieve an adaptive learning rate (given in Equation (
9)) allows the learning rate to depend on the observation of the user
\(j\) (
\(n_{oj}\)), which is more reliable than using a fixed rate for all users. Also, each observation of the user creates a progressively smaller update on the likelihoods, such that the effect of a new observation decreases as the number of recognitions of the user increases.
Supervised learning is necessary to achieve accurate online learning. The identity of the user should be known for updating the corresponding likelihoods, which can be achieved in HRI by asking for a confirmation of the estimated identity.
If the user
\(j\) is previously enrolled in the system, then the likelihoods are only updated for user
\(j\), as shown in Figure
16 D (in Appendix
A) based on the evidence in Figure
16 B. However, if the user
\(j\) is a new user, then online learning is applied on the face likelihood for the unknown state (
\(P(F^k|I^U)\)), followed by incremental learning by expanding the MMIBN (as described in Section
3.4), and finally by applying online learning for the new user, as illustrated in steps 8–18 in Figure
15 and in Figure
16 F. The likelihoods of gender, age, height, and time remain the same for
\(U\) to ensure uniform distribution.
3.6 Long-term Recognition Performance Loss
The standard metrics for open-set identification are
Detection and Identification Rate (DIR) and
False Alarm Rate (FAR) [
69]. DIR is the fraction of correctly classified probes (samples) within the probes of the enrolled users (
\(\mathscr{P}_\mathscr{E}\)), given in Equation (
10). FAR is the fraction of incorrectly classified probes within the probes of unknown users (
\(\mathscr{P}_\mathscr{U}\)), given in Equation (
11).
In other words, DIR represents the “true positive” (TP) of enrolled users, in which the current probe (referring to the multi-modal biometric sample) belongs to a previously enrolled user and identified correctly. FAR serves as a “false positive” (FP) for unknown users, that is, the probe belongs to an unknown user, but he/she is identified as an enrolled user. However, TP and FP are notions of verification problems, in which the probe is compared against a claimed identity, thus, are generally not applicable to open-set identification. Instead, the tradeoff between DIR and FAR that depends on the threshold of the identifier is generally represented by a Receiver Operating Characteristic (ROC) curve. The standard practice in biometric identification is to determine the desired FAR, which would then set the threshold and DIR.
Depending on the biometric application, the cost of incorrectly identifying a user as known may be very different from the cost of incorrect identification of the enrolled user [
47]. For short-term interactions, in which a user will be encountered 1–2 times, FAR is as important or more important than DIR. However, for long-term interactions, users will be encountered a greater number of times. Thus, correctly identifying a user (in a closed-set) becomes more important than correctly identifying an unknown user (open-set). Hence, we introduce the
long-term recognition performance loss (
\(L\)) that creates a balance between DIR and FAR based on the average number of observations per user (
\(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}n_o\hspace{-0.83328pt}}\hspace{0.83328pt}\)), as presented in Equation (
12), where
\(\alpha\) is the ratio of importance of
\(DIR\) compared to
\(FAR\).
Weights of MMIBN are optimised through this loss function, for gender, age, height, and time in
\([0, 1]\) range, along with quality (
\(Q\)) that can change within
\([0, 0.5]\) range. Ideally
\(L=0\), where all unknown users are identified as such (FAR
\(=0.0\)) and the known users are correctly identified (DIR
\(=1.0\)).
3.7 Normalisation Methods
The scores from each modality must be normalised into a common range (e.g.,
\([0,1]\)) to ensure a meaningful combination. It is important to choose a method that is insensitive to outliers and provides a good estimate of the distribution [
44], such as, minmax, tanh [
37], softmax [
11], and normsum (dividing each value by the sum of values). We introduce
hybrid normalisation, which combines the methods that achieve the lowest loss for each modality. In other words, hybrid normalisation uses the best-performing normalisation method for each modality. Extensive tests were made on the dataset obtained from our previous work in Reference [
43] to get the optimal methods for each modality (
\(F\),
\(G\),
\(A\),
\(H,\) and
\(T\)). The long-term recognition performance loss was compared for each combination of the individual modality with face recognition (
\(F\),
\(F\)-
\(G\),
\(F\)-
\(A\),
\(F\)-
\(H\),
\(F\)-
\(T\)) by optimising the weights for each of the combinations. The resulting hybrid normalisation uses normsum for face, gender, and height; tanh for age; softmax for time of interaction.
4 Recognition Module
While MMIBN can be applied on other platforms, its main purpose is for enabling incremental user recognition in long-term human-robot interaction in the real world. The proposed approach does not require heavy computing, therefore, it is suitable for use on commercially available robots. We employ this system on Pepper and NAO
7 robots, which are amongst the most commonly used robots in HRI research [
53], for our experiments (as described in Section
8). These robots are operated by NAOqi
8 software, which includes different modules that allowed us to extract face similarity scores, gender, height, and age estimations from a single image through the Recognition Module in Figure
13 (Appendix
A). The internal states of the proprietary algorithms (developed by OKAO) are inaccessible, hence, we assume that the gender and age estimations are not used to obtain the face similarity scores, and they are conditionally independent of the FR results, even though they are obtained from the 2D image. The height estimation in NAOqi is measured through the 3D sensor (in the eyes) of the Pepper robot and based on the face position in the 2D image and the geometric transformations (based on the camera relative to the robot) for the NAO robot. Due to relying on only one primary biometric, in the absence of facial information, the user is not recognised, since soft biometric information would not be sufficient to estimate the identity.
MMIBN can be used with any identifier software. The reason NAOqi identifiers are chosen is their capability for incremental recognition and their real-time performance, in other words, these algorithms work on a single CPU on a robot without requiring preliminary training. In contrast, the state-of-the-art deep learning methods for face recognition (such as Dlib [
50]) are not optimised for low computational power systems, hence, they may require a vast amount of time for encoding images, recognition and retraining,
9 which makes them unsuitable for real-time open world user recognition on a robot. Similarly, OpenFace
10 [
2], which is an implementation of FaceNet [
79] and a popular closed-set face identification method, was found to be unsuitable for real-world HRI, because the classifier needs to be retrained after a new user enrolment with all the available data (instead of incremental learning) with batch learning of images for the new user, and the training time (albeit small) increases with the increasing number of users [
2]. In addition, preliminary evaluations of OpenFace
11 provided unpromising results in new user identification. For instance, the first author was recognised consistently as Anne Hathaway with a high confidence (85 to 99.2%), despite the fact that the classifier was trained on only 10 users with 600 images per user (i.e., the classifier must be very accurate in identifying known users), and the author does not resemble her that highly.
Nevertheless, it is possible to use OpenFace or other identifiers, instead of the NAOqi user recognition algorithms, for obtaining the multi-modal biometric information for MMIBN.
5 Multi-modal Long-Term User Recognition Dataset
Our prior work provided evidence that the proposed model is suitable for long-term HRI in the real world. However, the optimised parameters of the model could not be generalised to a larger population due to the limited number of users and their narrow age range in that study. However, collecting a diverse training set within a long-term real-world HRI scenario is very challenging. To the best of our knowledge, the only publicly available dataset that contains the soft biometrics used in our system (except for the time of interaction) with a dataset of faces is BioSoft [
74]. However, due to the low number of subjects (75), and the lack of numeric height values, we decided to create our own Multi-modal Long-term User Recognition Dataset.
Datasets that contain images in the form of “mugshots,” such as NIST Mugshot Identification Database,
12 do not represent real-world HRI interactions in which the obtained images from the robot’s camera may vary greatly depending on the users’ actions and the environmental conditions. Therefore, it is important to use an image dataset with real-world variations, along with ground truth values of identity, gender, and age of users to assess the performance of our model and the corresponding identifiers in similar conditions. The largest publicly available dataset of face images with gender and age labels is the IMDB-WIKI dataset [
71,
72], which contains more than 500K images of 20K celebrities with a wide age range. As can be observed in Figure
3, the images in this dataset may contain bad lighting conditions, occlusions, oblique viewing angles, a variety of facial expressions, partial faces of other people, face paint and disguise, and black-and-white images, because the images come from movies, TV series, and events.
In addition to images, the estimated height of the user and the time of interaction with the robot would be necessary for user recognition in various HRI scenarios, where the users will be encountered sequentially over time. Thus, we created the Multi-modal Long-term User Recognition Dataset by (1) sampling a subset of the IMDB-WIKI image dataset and (2) artificially generating height estimations and various time of interactions to simulate repeated encounters of the users with the robot. The resulting dataset contains 200 users (101 females, 98 males, and 1 transgender person; the age range is 10 to 63) with 10 to 41 images per user that adds up to 5,735 images, height estimations, and various (patterned and random) time of interactions, along with a database of users’ names, genders, ages, and heights. Moreover, NAOqi identifier estimations (face similarity scores, gender, and age estimations) are obtained for each image and provided alongside the artificial height estimations and the time of interaction to simulate the information that would be acquired from a robot (e.g., NAO or Pepper) in an HRI scenario. The Multi-modal Long-term User Recognition Dataset is available online.
135.1 Image Sampling
In the scope of this work, only one user is assumed to be present in each image, hence, the cropped faces of IMDB dataset is used. To simulate an open world HRI scenario, where the users will be met in consecutive days or weeks, we chose images of users that are from the same year. Furthermore, we assume that the average number of times a user will be observed is \(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}n_o\hspace{-0.83328pt}}\hspace{0.83328pt}\ge 10\), which is a reasonable assumption for long-term HRI. Hence, we choose celebrities who have more than 10 images each corresponding to the same age. Moreover, to assess the incremental learning capabilities of our model with a user database that is more realistic for HRI (i.e., sufficiently large with 100 to 200 users instead of thousands of users), we (randomly) sampled 200 users out of 20K celebrities.
To create a diverse set of ages in the dataset, the images that correspond to an age that is within the five most common ages (25, 26, 28, 30, 31) in the set were randomly rejected (with 50% probability) during the selection. For instance, Anne Hathaway has sufficient images corresponding to 25 and 27 years old in the IMDB-WIKI dataset. However, 25 is among the five most common ages, thus, with a 50% chance, this set of images was excluded from the selection, hence, the images of Anne Hathaway corresponding to 27 years old were chosen instead. This also resulted in some celebrities who only have images corresponding to a certain age in the dataset to be excluded from the selection.
The resulting age range is 10–63, with the mean age of 33.04 (SD\(=9.28\)).
Subsequently, the dataset is cleaned in three steps: by removing (1) images with a resolution lower than 150
\(\times\) 150, (2) images without a face detected by NAOqi, (3) images that erroneously correspond to another person. Furthermore, in order of user appearance (as detailed further in Section
6.2), NAOqi identifiers are applied on the selected images to obtain face similarity scores, gender, and age estimations. If the user has not been previously encountered, then the same image is used to identify the user before and after enrolment to the face database in NAOqi.
5.2 Height and Time of Interaction
Height was found to be the most important soft biometric in determining the identity in Reference [
43]. To validate whether this finding persists for a large number of users with diverse characteristics and optimise its weight for applying it to real-world HRI experiments, we artificially created height data for each user.
To keep the data realistic and model the differences between the estimated heights, Gaussian noise with
\(\sigma =6.3\) cm (as found in Reference [
43] for NAOqi height estimation) is added to the actual heights of the users obtained from the web.
Given our assumption that the users will be encountered at least 10 times in long-term HRI, we created two datasets: (1) D-Ten, where each user is observed precisely 10 times, e.g., 10 return visits to a robot therapist, and (2) D-All, in which each user is encountered a different amount of times (10 to 41 times). Two types of distribution are considered for the time of interaction: (1) patterned interaction times in a week modelled through a Gaussian mixture model, where the user will be encountered certain times on specific days, which applies to HRI in rehabilitation and education areas, and (2) random interaction times represented by uniform distribution, such as in domestic applications with companion robots, where the user can be seen at any time of the day in the week. As a result, we created four (sub)datasets as part of the Multi-modal Long-term User Recognition Dataset: D-TenUniform, D-TenGaussian, D-AllUniform, D-AllGaussian.
6 Evaluation
In this section, we evaluate our proposed models based on the hypotheses presented in Section
6.1. The procedure of creating the cross-validation sets is described in Section
6.2. Initially, the parameters of the multi-modal incremental Bayesian network (Section
6.3) are optimised for open world recognition in long-term interactions in Section
6.4. Using those parameters, the model is compared to face recognition and soft biometrics on the Multi-modal Long-term User Recognition Dataset for the training set, closed-sets and open-set tests in Section
6.5.
6.1 Hypotheses
H1 Our proposed multi-modal incremental Bayesian network will improve user recognition compared to face recognition alone. As measured by a decrease in the long-term recognition performance loss (\(L\)) and an increase in the identification rate of known users (DIR).
H2 Online learning will improve user recognition over a non-adaptive model. As measured by a decrease in \(L\) and an increase in DIR.
H3 Hybrid normalisation will outperform the individual normalisation methods.
H4 When assumptions are made about the temporal interaction pattern of the user, recognition will improve. When the time of interaction is uniformly distributed, the loss \(L\) will be higher.
These hypotheses will be validated with various analyses, as provided in Table
1.
6.2 Procedure
Repeated k-fold cross-validation is used to evaluate the model stability and performance. The procedure is described in Algorithm
1 in Appendix
B. Two methods for creating validation folds are used, namely, OrderedKFold and ShuffledKFold. OrderedKFold is the case where users are introduced one-by-one to the system without any repetitions of previous users during the enrolment. The order of repeated interactions is random after the enrolment. In ShuffledKFold, there can be repetitions of the previous user(s) before another user is introduced, because the order of overall samples is random. OrderedKFold is similar to batch learning in an incremental learning sense, whereas the iteration (repeat) created by ShuffledKFold is more similar to a real-world scenario. Our aim is to evaluate if there are any performance differences between the two cases and to prove that the model is stable across several repeats. A stratified random bin order is used for having a different initial bin and final bin in each fold to ensure a different enrolment order of users and a different test set, respectively. We chose K
\(=5\) folds and R
\(=11\) repeats.
Each dataset (D-Ten and D-All) is divided into two with 100 users each. The first set is then divided through cross-validation procedure with 80%–20% ratio of data to the training set (first four bins, corresponding to 800 samples in D-Ten and 2,308 in D-All) and closed-set (training) (final bin, corresponding to 200 samples in D-Ten, 578 in D-All). The open-set is created from the remaining 100 users (800 samples in D-Ten, 2,280 in D-All). The closed-set (open) is similar to the closed-set (training), which corresponds to the final bin in each fold (200 in D-Ten, 569 in D-All). The open-set evaluation is made by introducing the open-set samples after the training set, that is, 100 users are enrolled in the system and recognised multiple times before the introduction of 100 new users. However, the results for the open-set do not include the results for training.
The only difference between Gaussian and uniform datasets is the time of the interaction for each sample; that is, the order of the samples is the same.
For online learning, the likelihoods are learned during the training phase (training and open-set cases), and the learned likelihoods are used without online learning for the closed-set cases.
6.3 Description of Variables
Given our datasets and the parameters of our model, we have four independent variables and three dependent variables for analysing the results on the evaluation sets: training, open-set, closed-set (training), closed-set (open). The dependent variables are DIR in Equation (
10), FAR in Equation (
11) and long-term recognition performance loss (shortly, loss) in Equation (
12). The independent variables are as follows:
(1)
Dataset size: 10 samples per user (D-Ten), random amount of samples (D-All)
(2)
Timing of interaction: patterned interaction times (Gaussian), random interaction times (uniform)
(3)
Model: non-adaptive MMIBN, MMIBN with online learning (MMIBN:OL)
(4)
Normalisation method: softmax, minmax, tanh, normsum, and hybrid
6.4 Optimisation of Parameters
The parameters of the MMIBN need to be optimised to achieve the best recognition results. Correspondingly, we conducted several evaluations on the Multi-modal Long-term User Recognition Dataset as described in detail in Appendix
C. Here, we summarise our findings for reasons of perspicuity.
Initially, the loss parameter
\(\alpha\) is set as 0.9, based on our average number of observations assumption (
\(\hspace{0.83328pt}\overline{\hspace{-0.83328pt}n_o\hspace{-0.83328pt}}\hspace{0.83328pt}=10\)) for long-term interaction (Appendix
C.1). Subsequently, the optimum face recognition threshold with the lowest loss for (NAOqi) FR is found to be 0.4 (Appendix
C.2).
MMIBN relies on the assumption that the multi-modal biometric information (face, gender, age, height, and time of interaction) are conditionally independent given the identity of the user, since the individual identifiers do not affect each other’s results. Accordingly, we assumed that the NAOqi identifiers (face, gender, and age) are conditionally independent of each other, despite relying on the same visual input (2D image). Structural learning of the Bayesian network on the Multi-modal Long-term User Recognition Dataset (in Appendix
C.3) confirmed this assumption, showing that the naive Bayes classifier model is sufficient and suitable for multi-modal user identification, even when the modalities use the same input. Moreover, the average learned likelihoods in online learning are very close to the initially assumed network parameters in Section
3.4.
Bayesian optimisation
14 is applied with these parameters to minimise the loss for each combination of the independent variables (40 conditions) by optimising the weights for soft biometrics and the threshold for the quality of the estimation (see Appendix
C.6). Figure
4 shows how the loss decreases during the optimisation, which results in an increase in DIR at the cost of an increase in FAR. The resulting loss of MMIBN is much lower than that of FR, and correspondingly DIR and FAR are much higher. Note that
\(\alpha\) can be adjusted to give more importance to FAR or a FAR can be set prior to optimisation, which may lead to a different set of optimised parameters.
While the average standard deviation of NAOqi age estimation is found to be higher (11.0) than in Reference [
43] (9.3), the age is found to be the most important parameter and height the least (see Appendix
C.6), in contrast with the findings in Reference [
43]. Due to the higher number of users (200) and the diverse age range (10–63) in the Multi-modal Long-term User Recognition Dataset, these results are more generalisable than our prior work. Moreover, when the ground truths are not taken into account, the standard deviation of age within the estimations is found to be 8.2, which is less than the average. This is due to the appearance of users (e.g., a 30-year-old person may look like 25), which suggests that online learning of likelihoods (MMIBN:OL) may provide better recognition performance over time, as the identifiers will get better at identifying users based on their own estimations instead of ground truth values. In addition, NAOqi gender recognition is found to be equally accurate for males and females with 0.9 as the recognition rate (i.e., users’ genders are correctly recognised 90% of the time). Furthermore, using the confidence of the estimations instead of exclusively the estimated biometric data (e.g., estimated gender or age, as described in Section
3.1) allows overcoming deviations in the estimations.
With these optimised parameters, 11 repeats of 5-fold cross-validation were applied for each of the conditions (Appendix
C.4), which showed that MMIBN models are stable across repeats (i.e., no significant difference in loss between repeats), and the models perform equally well for learning new users incrementally sequentially (OrderedKFold, similar to batch learning) and at random intervals (ShuffledKFold, similar to a real-world scenario). However, the size of the dataset, timing of interaction, and normalisation method are found to have significant effects on the performance of the model, however, the non-adaptive model and the model with online learning performed equally well.
Hybrid normalisation is found to outperform the other normalisation methods in all conditions (Appendix
C.5), supporting our hypothesis
H3. The models achieved lower loss in D-All than in D-Ten, which showed that the proposed model gets better with the increasing number of recognitions. However, hybrid normalisation with online learning (MMIBN:OL) is found to perform worse than the non-adaptive model (MMIBN), in contrast with our hypothesis
H2. Moreover, most methods are found to perform significantly worse when there is no interaction pattern (uniform timing of interaction), as compared to patterned (Gaussian) interactions, supporting our hypothesis
H4.
6.5 Comparison to Baselines
On the grounds that the optimised parameters of our proposed MMIBN are found, we can compare its results to
face recognition (FR) and
soft biometrics (SB). FR results are obtained from the NAOqi estimations by setting FR threshold (
\(\theta _{FR}\)) to 0.4. SB results are obtained by giving zero weight to FR, that is, only gender and age estimates from NAOqi, artificial height estimates and time of interaction are used for identifying a user. The weights of these modalities in SB are the same as MMIBN, as shown in Figure
19 (Appendix
C.6). Similarly, the weights of SB:OL are the same as those of MMIBN:OL.
We transformed a state-of-the-art open world recognition method,
Extreme Value Machine15 [
73]
(EVM), to accept sequential and incremental data for online learning by adjusting its hyperparameters to use it as a baseline, as described in Appendix
D. In the original work, batch learning of 50 classes was used with an average of 63,806 data points at each update, instead of a single data point that we used in this work. We compared our methods with the performance of two EVM models: (a) EVM:FR, using NAOqi face recognition similarity scores as data, (b) EVM:MM, using multi-modal information in the same format as it is used for our methods.
Section
6.5.1 compares the long-term recognition performance loss (shortly, loss) between the models. Appendix
C.4 provides evidence that there is a significant correlation between loss and DIR, and loss and FAR, but no significant correlation is found between DIR and FAR. Hence, the analysis of loss is sufficient to determine how the model performs in comparison to others. Nevertheless, we will report the results of FAR and DIR of the models in Section
6.5.2 to further observe how the open-set recognition metrics are affected.
6.5.1 Long-term Recognition Performance Loss.
As previously mentioned, the proposed models perform better in terms of loss in D-All than in D-Ten, however, the results for D-Ten datasets show similar patterns to that of D-All. Based on the same number of recognitions for both D-All and D-Ten, which is equal to the number of samples in D-Ten for all evaluation sets, ANOVA shows that there is no significant difference in the sample size (\(p=.67\)), as the models perform equally well for D-All and D-Ten for the same number of samples. In other words, it does not matter if each user is observed the same number of times or not. This also supports that a higher number of samples increases the performance of the models. Hence, the following analysis will only be focused on D-All, but any differences in performance between the two datasets will be noted.
We conducted Tukey’s
Honestly Significant Differences (HSD) tests on the training, open-set, closed-set (training), closed-set (open) evaluation sets for D-All datasets with Gaussian and uniform timing of interaction. The corresponding plot is given in Appendix
E.1.
The results show that the proposed approaches (MMIBN and MMIBN:OL) decrease the long-term recognition performance loss significantly (\(p\lt .001\)) and substantially compared to FR, supporting the first part of our hypothesis H1. This finding is valid across all datasets (D-Ten and D-All for Gaussian and uniform times).
MMIBN performs equally well between Gaussian and uniform timing for D-All evaluation sets (i.e., no significant difference, but slightly worse in uniform), whereas it does not perform at the same significance in D-Ten evaluation sets (performs significantly worse). MMIBN:OL performance changes depending on the dataset size and the evaluation set (performs equally well only in closed-sets in D-Ten, and for training and closed-set open in D-All). Nevertheless, the models have slightly or significantly higher loss in uniform timing as compared to Gaussian, supporting hypothesis H4.
Online learning does not perform better than MMIBN, because it increases the loss at all conditions. In fact, except for training set in D-All and D-Ten and closed-sets in D-Ten for uniform timing where MMIBN and MMIBN:OL perform at the same significance level, online learning is significantly worse, which is in contrast with our hypothesis H2.
Furthermore, the results show that soft biometric features (SB and SB:OL) are not able to identify a user on their own. In general, they perform significantly worse than FR. However, when the interaction is time patterned (Gaussian), SB performs better and closer to FR as compared to uniform timing. Especially for closed-set training in D-All, it is remarkable that SB features identify the user with the same significance level performance as FR. SB and SB:OL perform mostly equally well in D-All datasets, but SB:OL performs significantly worse in several evaluation sets in D-Ten.
EVM:FR performs significantly better (\(p\lt .005\)) than FR across all conditions. EVM:MM is significantly worse than EVM:FR (\(p\lt .01\)) and it does not perform better than FR in most conditions. This shows that although EVM is a good method for clustering face recognition data, it does not perform well with multi-modal data.
MMIBN significantly outperforms (\(p\lt .001\)) both EVM models across all conditions in both D-All and D-Ten. This proves that our proposed approach is significantly better than the state-of-the-art method for incremental open world recognition with multi-modal biometric information. However, EVM models use online learning instead of fixed learning rates, which could potentially lead to worse performance as observed for our model. Nevertheless, comparing EVM models to MMIBN:OL shows that MMIBN:OL significantly outperforms EVM models (\(p\lt .05\) to \(p\lt .001\)) in most cases, except for uniform timing for open-set and closed-set (open) in D-All and open-set in D-Ten, in which, it performs equally well with EVM:FR.
MMIBN performs equally well between training and open-set cases as well as between closed-sets, which shows that the model scales well for an increase in users (from 100 to 200 users), suggesting that the proposed approach and the optimised weights can generalise. Similar to the results in Reference [
73], EVM performs equally well between those sets, showing that the change in model from batch updates to incremental updates have not changed its structure for scaling well. The models perform significantly better in closed-sets as compared to training or open-set due to the lack of unknown users in closed-sets (FAR
\(=0.0\)). Hence, loss only depends on DIR.
The models are trained on several examples of the users before the closed-set. The model performance improves with the increasing number of recognitions and stabilises towards the end (around 2,000), as can be observed in Figure
5. This supports our initial finding of performance difference between D-All and D-Ten, given that they perform equally well for the same number of recognitions. Initially, loss increases with increasing FAR when the users are introduced to the system (represented by dots in the plot). As the number of recognitions increases, the introduction of a new user does not notably increase the loss, as can be observed by the final three new users in the training set. Even though MMIBN models get better over time, they start performing consistently better than both FR and EVM models throughout both training and closed-set after only a small number of recognitions (15–48 in training, 1–6 in closed-set).
The sudden change at the beginning for the training set is due to the sequential calculation of loss for time plots: A previously enrolled person has not been identified correctly for the first time, which changes DIR from 1.0 to 0.5 (one out of two enrolled users was incorrectly identified). Note that the introduction of new users is at random order due to ShuffledKFold function, described in Section
6.2. The results for the open-set, as given in Appendix
F, show a similar pattern of loss between open-set and closed-set (of the open-set cross-validation).
6.5.2 Open-set Identification Metrics: DIR and FAR.
The previously presented results confirm our claims that our proposed multi-modal Bayesian networks perform significantly better than FR, SB and EVM in long-term interactions. Nonetheless, analysing the open-set identification metrics allows us to understand how the models perform for enrolled and unknown users through DIR and FAR, respectively. The detailed presentation of Tukey’s HSD results is shown in Appendix
E.2.
The results show that the increase in DIR is significant (\(p\lt .001\)) and drastic, from 0.268 of FR to 0.657 with MMIBN and 0.561 with MMIBN:OL averaging over all the conditions in D-All (timing of interaction and evaluation set). That is a 38.9% increase in identifying the users correctly by using MMIBN, no matter the condition, which is more than double what FR is capable of providing. Hence, our hypothesis H1 that the loss will be reduced and DIR will be increased using our proposed models as compared to FR alone is fully and strongly supported.
It should be noted that the increase in DIR provided by our network is significantly higher (\(p\lt .001\)) than DIR of soft biometrics (0.226 on average for Gaussian timing in D-All). This shows that soft biometric data are not sufficient to identify an individual, yet when combined with the primary biometric, they improve the identification rate significantly (38.9% in D-All, and 31.8% in D-Ten). This conclusion is supported by the datasets where the time of interaction is uniformly distributed (DIR of SB is 0.013 on average), that is, due to the high variability of time, the identification rate of SB is close to zero. Nevertheless, MMIBN performs equally well in Gaussian, and uniform timing within all evaluation sets in D-All, and MMIBN:OL performs equally well in D-Ten. As previously noted in H4, the loss is (slightly or significantly) higher and DIR is (slightly or significantly) lower for all datasets and MMIBN models between Gaussian and uniform timing.
MMIBN significantly outperforms both EVM methods in DIR in all datasets (\(p\lt .001\)). EVM:FR has significantly higher DIR than FR and EVM:MM (\(p\lt .001\)). EVM:FR performs equally well between uniform and Gaussian timing in all datasets, because it is trained only on FR data. DIR of EVM:MM drops below that of FR for uniform timing for both D-All and D-Ten, which shows that EVM is not a model to be used with time information, since the pattern of interaction with the user might not be known beforehand. Similarly, MMIBN:OL provides worse performance for uniform timing in D-All, but it always performs significantly better than or equally well with EVM:FR.
FR performs similarly in open and closed-sets in terms of loss, because it has significantly low FAR compared to MMIBN models. While low FAR is a desirable feature, the underlying reason for low FAR is that FR has very poor recognition performance on larger datasets and fails to recognise the users, because the highest similarity score returned by the identifier is lower than the threshold (
\(\theta _{FR}=0.4\)). However, as described in Appendix
C.2, this threshold ensures the lowest loss for FR.
FAR of the proposed models is high because of the combination of all modalities, which increase the probability of mixing the unknown user with an enrolled user. Possible solutions to this problem will be proposed in Section
7. For our proposed models, FAR in the training set is generally slightly less than that of open-set, because of the higher number of users enrolled, but there are no significant differences across the datasets for MMIBN, supporting that the model scales well to a larger dataset without a significant decrease in performance.
In the training set, there is no significant difference between MMIBN and EVM models for FAR, and MMIBN:OL performs significantly better than EVM models for uniform timing. In contrast to MMIBN, EVM provides significantly lower FAR in open-sets than in training sets. The authors state in Reference [
73] that this is due to its ability to tightly bound class hypotheses by their support.
6.5.3 User-specific Analysis.
Confusion matrices presented in Figure
6 show how users were identified throughout the training set in D-All for a fold of the cross-validation, with 0 as the ID of the unknown user and the remaining numbers corresponding to IDs of the enrolled users. The heat map represents the percentage of identification of the user as the estimated user. Ideally, the diagonal should be all dark red if users are correctly identified. However, FR (item A) mostly identifies the users as unknown, resulting in the corresponding vertical axis of 0 to be mostly red and in a low FAR and a low DIR. MMIBN (item B) has mostly red-coloured dots on the diagonal but has mixed users with other enrolled users, as can be seen from light-blue dots all over the matrix. MMIBN:OL shows a similar pattern with slight deviations.
Even though EVM:FR (item C) only uses FR information, its confusion matrix is different from that of FR. The misidentifications are highly concentrated on the final 10 users, suggesting that either FR or EVM might be subject to the catastrophic forgetting problem. Using multi-modal data overcomes that problem, as can be seen for EVM:MM (item D), as misclassifications are evenly distributed, similar to MMIBN. However, the diagonals in EVM models have notably fewer reds than MMIBN.
The significant differences in identification of users over the 5-folds of cross-validation, as presented in Appendix
E.3, shows another striking result. FR does not perform equally well amongst the users in that there are significant differences of identification. Our proposed approach MMIBN balances the performance amongst users, thereby reducing any recognition bias in the system while improving the performance of the overall system significantly as compared to FR. Online learning (MMIBN:OL and EVM:FR) balances the performance further, in contrast to the decrease in performance compared to MMIBN. EVM:MM shows a similar pattern.
Figure
7 demonstrates examples from D-All
Gaussian where face recognition fails to recognise the user due to the low similarity score (
\(\lt \theta _{FR}=0.4\)), whereas our proposed model identifies the user correctly based on soft biometric information. The quality of the estimation (
\(Q\)) varies depending on the highest FR similarity score, as well as the disagreement between modalities. For example, for the third user (Sandra Oh), the highest FR similarity score (rank 1) is very low, corresponding to David Schwimmer, who is 28 years old in the dataset, has a height of 185 with the enrolment time of interaction on Tuesday at 18:16. Age did not provide information to differentiate the user from the incorrect estimation, whereas height and time of interaction increased the probability that the user is Sandra Oh, resulting in a correct estimation, but with a low quality score (
\(0.35\gt \theta _Q=0.013\)). The second user (Gary Coleman) was identified correctly by FR with the highest similarity score close to, but slightly lower than
\(\theta _{FR}\). This was enforced by the age estimation and the time of interaction, which compensated for the incorrect recognitions of gender and height, to get a high quality score (7.44).
6.5.4 Real-time Capabilities.
In contrast to the state-of-the-art deep learning methods, the proposed models can run on a commercial robot with low computational power (on a single CPU of Pepper robot) and only require a small amount of time for execution. In addition to the time required from FR and other modalities (M
\(=0.14\) s, SD
\(=0.001\)), MMIBN models take 0.01 second for recognition, significantly outperforming both EVM:FR and EVM:MM, which take 0.32 and 0.34, respectively.
16 For enrolling new users, MMIBN requires a significantly lower amount of time (0.39 s,
\(p=.002\)) for scaling the Bayesian network, compared to MMIBN:OL, which takes 0.54 s, for which 0.17 s is due to online learning. There is no significant difference between MMIBN:OL and EVM models for enrolling (EVM:FR takes 0.48 and EVM:MM takes 0.52 s), with 0.20 and 0.23 s for online learning, respectively. The higher amount of time required for EVM:MM compared to EVM:FR shows that online learning takes longer time when there is more information to be learned per user. Note that the time required for MMIBN has decreased from 0.3 s in Reference [
43] to 0.01, as a result of optimising the MMIBN algorithm.
Moreover, in comparison to deep learning approaches, which require “big data” to be pretrained, our proposed models are able to start from a state of no enrolled users, learn users continuously and incrementally, and improve performance compared to FR after a small number of recognitions (e.g., 48 in Figure
5).
9 Conclusion
User identification is mostly regarded as a solved problem in the computer vision field. What remains unsolved is its application to the real world on low-computational power systems, such as commercial robots. The core problem that we face within HRI for personalising the interaction is to recognise unknown users and enrol them incrementally, which is classified as open world recognition. However, there exists a limited amount of research on this topic, and none of the available methods is evaluated on user identification. These methods use batch learning of classes instead of sequential learning, which is unlikely to be the case for HRI, because the users might not be available at the same time. In contrast, it is more likely that the same users will be encountered several times before the introduction of another.
Moreover, the computer vision field is not generally concerned with long-term interactions. Hence, correct identification of the enrolled users (DIR) and incorrect identification of the unknown users (FAR) are of equal value, whereas the former is more valuable in long-term interactions, since the same user is expected to be recognised several times, and the fraction of newly enrolled users will be much less. Furthermore, the appearance of the user may change over time, which requires updating the user database accordingly through online learning. In addition, combining soft biometrics, which are ancillary physical or behavioural characteristics (e.g., age) that can be extracted from primary biometric data (e.g., face) or available through other sources of information (e.g., time of interaction), can improve recognition accuracy.
In this work, we addressed these open challenges and presented a multi-modal incremental user recognition approach with online learning that is suitable for long-term HRI in the real world. We validated the approach within a variety of settings using an artificially generated multi-modal dataset and through three real-world HRI experiments, thereby extending the findings in our prior work [
43] for a large number of users.