2.1. Human–Computer Intelligent Interaction (HCII)
HCI is “a discipline concerned with the design, evaluation, and implementation of interactive computing systems for human use and with the study of major phenomena surrounding them” [
22]. HCI is an interdisciplinary discipline grounded initially on computer science, psychology, and ergonomics and disciplines that joined later, for example, social science, cognitive science, etc. [
23]. In the HCI, the user’s activity includes the following aspects [
24]: (1) physical, which determines the mechanics of the interaction between human and computer; (2) cognitive, which deals with the way that users can understand the system and interact with it; and (3) affective, which tries to make the interaction pleasurable to the user as well as to affect the user in a way that makes the user continue to use the machine by changing attitudes and emotions.
HCI is a two-way communication: (1) computer to user, and (2) user to computer. In computer to user communication, the main challenge is how to present the information efficiently. Modern UIs with innovative interface technology (e.g., virtual reality, 3D displays, etc.) have enabled new ways of information delivery to the user. A virtual agent or avatar that can mimic human behavior is another example of innovative solutions for delivering information to users. In the user to computer communication, the main challenge is enabling the user to command the computer to do things naturally. By combining human-centered design with leading-edge technologies, UIs move from keyboards, mouses, and touchscreens to IUIs that use different modalities for computer commands, including voice recognition, computer vision, and others. Novel HCII systems equipped with AI methods and techniques can respond to verbal commands (e.g., speech-based systems, such as Alexa from Amazon [
25]), and non-verbal commands (e.g., Soli from Google [
26]).
Figure 1 presents an example of a multimodal HCII system architecture that provides multimodal input/output capabilities for intelligent interaction with the user. HCII systems support multimodal input/output capabilities that, compared to standard HCI systems based on a keyboard and a mouse, provide more flexible and expressively powerful interaction with a computer. The HCII system usually provides a multimodal input enabling user input and processing of two or more modalities (e.g., touch, gaze, body movement, virtual keyboard, etc.). The user input can be based on standard simple-input input devices (e.g., keyboard, mouse, touch, etc.), recognition-based technologies (e.g., speech, gesture, emotion, etc.), or sensor-based technologies (e.g., acceleration, pressure, brain signal, etc.) [
1]. HCII systems also support multimodal or multimedia output involving two or more types of information received as feedback by a user during HCI. Multimedia output can provide different media types within one modality, such as vision (e.g., still images, virtual reality, video images, etc.) or multimodal output, such as visual, auditory, and tactile feedback to the user [
1].
In the last decade, there has been a rapid increase in existing literature in the HCII field. HCII aims to provide natural ways for humans to use computers and technology in all aspects of future peoples’ life and is relevant in applications, such as smart homes, smart offices, virtual reality, education, and call centers [
27,
28]. For an effective HCII, computers must have the communication skills of humans [
29] to be able to interact with the users naturally [
30] by enabling interactions that are able to mimic human–human interactions [
27]. HCII solutions must implement at least some kind of intelligence in perception from and/or in response to the users [
24].
A natural human–human interaction consists of a mix of verbal signals (e.g., speech, intonation, etc.) and non-verbal signals (e.g., gestures, facial expression, eye motions, body language, etc.). Nonverbal information can be used for predicting and understanding a user’s inner (cognitive and affective) state of the mind [
31]. In order to provide genuinely intuitive communication, computers need to have their sense of verbal and non-verbal signals in order to understand the message and the context of the message [
32]. A robust, efficient, and effective HCII system must therefore be able to activate different channels (e.g., auditory channels which carry speech and vocal intonation, a visual channel that carries facial expression and gestures, etc.) and modalities (e.g., sense of sight, hearing, etc.) that enable effective detection, recognition, interpretation, and analysis of various human physiological and behavioral characteristics during the interaction [
33,
34].
The research and developments in both hardware and software have enabled the use of speech, gestures, body posture, different tracking technology, tactile/force feedback devices, eye-gaze, and biosensors to develop new generations of HCI systems and application [
33]. HCI systems that use only one type of input/output or modality enabling the HCI are called unimodal systems. Multimodal HCI systems, on the other hand, use many input or output modalities to communicate with the user, exhibiting some form of intelligent behavior in a particular domain [
24,
33]. Based on the way of information transfer from user to the computer, UIs can be divided into [
35]: (1) contact-based interfaces (e.g., keyboard and mouse based interfaces, touch screen interfaces, etc.), (2) speech-based interfaces (e.g., spontaneous speech, continuous speech, acoustic nonspeech sounds, etc.), (3) gesture-based interfaces (e.g., finger pointing, spatial hand gestures, sign language, head gestures, user behavior, etc.), (4) facial expression-based interfaces (e.g., facial expressions, including those reflecting emotions, articulation of lips, gaze direction, eye winking, etc.), (5) textual and hand-writing interfaces (e.g., handwritten continuous text, typed text, etc.), (6) tactile and myo interfaces (e.g., sensor gloves and body-worn sensors, EMG sensors, etc.), and (7) neural computer interfaces (e.g., EEG signal, evoked potential, etc.).
Based on the nature of the modalities used, the HCIs can be divided into (1) visual-based HCI, which use various visual information about human’s response while interacting with the machine, (2) audio-based HCI that use information acquired by different audio signals, and (3) sensor-based HCI that combine a variety of areas with at least one physical sensor used between user and machine to provide the interaction [
24]. The visual-based HCI research deals with the development of solutions for an efficient understanding of various humans’ responses from visual signals, including facial expression recognition, gesture recognition, gaze detection, and other areas. The audio-based HCI research includes speech recognition, speaker recognition, audio-based emotion recognition, and others. In sensor-based HCI, the solutions are being built using various sensors, which can be very primitive (e.g., a pen, a mouse, etc.) or sophisticated (e.g., motion-tracking sensors, EMG sensors, EEG devices, etc.).
Application fields of HCII are heterogeneous, and the creation of intelligent user interfaces aim to [
36]: (1) change the way information is displayed based on users’ habits in a particular operating environment, (2) improve human–computer interaction by processing natural language, (3) enabling HCI for users with limitations to interact with technological devices (e.g., improvement of accessibility of interfaces for blind users, using different sensors for acquiring data about user movements and translation of movements into commands sent to a wheelchair, interfaces for cognitive impaired users, etc.). More natural and efficient intelligent interaction paradigms, such as gesture interaction, voice interaction, and face recognition, are widely being implemented in new HCI applications (e.g., smart home solutions, autonomous cars, etc.) [
37]. In contrast to the conventional mechanisms of passive manipulation, HCII integrates versatile tools, such as perceptual recognition, AI, affective computing, and emotion cognition, to enhance the ways humans interact with computers [
38,
39].
Novel IUIs are not necessarily being built for replacing traditional interfaces that use input devices, such as a mouse and a keyboard, but are, rather, complementing when needed or appropriate. Solutions that enable users using speech and hand gestures to control the computer are useful especially in virtual environments, because, for example, in a 3D environment a keyboard and a mouse as an input device are not much useful. Speech recognition solutions that can recognize speech through visual signal (e.g., reading from lips) can complement speech recognition from audio signals in noisy environments where the recognition from audio signal cannot perform well. Intelligent HCI systems can also be used for enabling efficient human-to-human interaction when this is not possible due to the limitations of end-users. For example, an intelligent HCI system that combines AI with wearable devices (e.g., data gloves) can solve communication problems between a hard-of-hearing and a non-disabled person [
40]. Mobile IUI solutions today make use of the plethora of advanced sensors available in smartphones, such as camera, microphone, keyboard, touchscreen, depth sensors, accelerometer, gyroscope, geolocation sensor, barometer, compass, ambient light sensor, proximity sensor, etc., which allow the combination of inputs and enrichment of HCII interactions [
21]
As discussed above, the essential functions of HCII are based on a clear signal of emotional state to infer a person’s emotional state [
30]. Emotions are complex processes comprised of numerous components, including feelings, body changes, cognitive reactions, behavior, and thoughts [
41]. Emotion is a psycho-physiological process triggered by the conscious and unconscious perception of a situation or an object and is often associated with mood, temperament, personality, disposition, and motivation [
42]. Intelligent systems providing HCII must, for example, through emotion recognition, be able to perceive the user’s emotions, produce the ability of empathy, and respond appropriately [
30,
43]. By understanding emotions in natural interactions, HCII systems can make smarter decisions and provide better interactive experiences [
28]. Emotional interaction makes the human–computer interaction more intelligent—it makes the interaction natural, cordial, vivid, emotional [
43]. Because automatic emotion recognition has many applications also in HCII, it has attracted the recent hype of AI-empowered HCI research [
44]. Another essential and challenging task related to emotion recognition in HCII is speech emotion recognition [
45], which has become the heart of most HCI applications in the modern world [
46]. For many years, eye-tracking technology has been used for usability testing and implementation of various solutions for controlling the user interface. Eye-tracking-based UIs are, for example, various assistive technology solutions for people with severe disabilities (e.g., [
47]) that cannot use arms and standard input devices. However, eye-based cues (e.g., eye gaze) are another field of increasing interest to the research community for automatic emotion classification and affect prediction [
48].
Facial expression is a powerful, natural, and direct way humans communicate and understand each other’s affective states and intentions [
29]. Facial expression is considered a significant gesture of social interaction and one of the most significant nonverbal behaviors, through which HCI systems may recognize human emotions’ internal or affective state [
49]. The clues for understanding facial expressions lie not in global facial appearance but also in informative local dynamics among different but confusing expressions [
38]. Automatic facial expression recognition plays a vital role in HCII, as it can help non-intrusively apprehend a person’s psychopathology [
50]. Motivated by this significant characteristic of instantly conveying nonverbal communication, facial expression recognition plays an intrinsic role in developing the HCII and social computing fields [
51] and is becoming a necessary condition for HCII [
50]. With many applications in day-to-day developments and other areas, such as interactive video, virtual reality, videoconferencing, user profiling, games, intelligent automobile systems, entertainment industries, etc., facial expression has an essential role in HCII [
52].
Human gesture recognition has also become a pillar of today’s HCII, as it typically provides more comfortable and ubiquitous interaction [
2]. Human gestures include static postures (e.g., hand posture, head pose, and body posture) and dynamic gestures (e.g., hand gestures, head gestures like shaking and nodding, facial action like raising the eyebrows, and body gestures) [
53]. Hand gestures, for example, have been widely acknowledged as a promising HCI method [
54]. Information about head gestures obtained from head motion is valuable in various applications, such as autonomous driving solutions or assistive tools for disabled users [
55].
Furthermore, in existing research various data sources, sensors, and advanced AI methods and algorithms for innovative solutions for HCII systems have been proposed, such as the user’s activity recognition (e.g., [
56,
57]), depression recognition (e.g., [
58,
59,
60,
61]), affection recognition (e.g., [
62]), speech recognition (e.g., [
63]), user’s intention recognition (e.g., [
64,
65]), and others.
2.2. Sensors Technology for HCII
As stated in the previous section, the user input can be based on standard input devices, recognition-based technologies, or sensor-based technologies. Recognition-based technologies can be implemented using invasive methods and non-invasive methods. Invasive recognition-based technologies use sensors attached to a person (e.g., accelerometer sensor attached to the chest, waist, or different body parts). In contrast, non-invasive recognition-based technologies use non-attached sensors, e.g., vision-based sensors, such as a camera, thermal infrared sensor, depth sensor, smart vision sensor, etc. [
66]. Sensor-based HCI technologies are built using various sensors, which can be very primitive (e.g., a pen, a mouse, etc.) or very sophisticated (e.g., motion tracking sensors, EMG sensors, EEG sensors, etc.).
HCI devices with sensor capabilities can be divided into [
67]: standard input/output devices (e.g., mouse, keyboard, touch screen, etc.), wearables (e.g., smartwatch, smartphone, band, glove, smart glasses, etc.), non-wearables (e.g., camera, microphone, environmental sensors, etc.). Wearable devices with different kinds of built-in sensors (mechanical, physiological, bioimpedance, and biochemical) can provide data about the physical and mental state of the user [
68]. Wearable sensors, for example, are increasingly being used for measuring, in particular, biological signals, such as heart rate or skin conductance [
69].
Sensors can be divided into unimodal sensors, providing data about one single signal (e.g., accelerometer), and multimodal sensors (e.g., Body Area Sensor Network [
70], Kinect, RespiBAN [
71], Empatica E4 [
71], etc.). An accelerometer sensor is a device that captures vibrations and orientation of systems that move or rotate. The accelerometer sensor has been used to study activity recognition and physical exercise trackings, such as aerobic exercises [
72], gesture recognition [
53], human activity recognition [
73], and others. The Kinect sensor consists of an RGB camera, a depth sensor, an infrared sensor, and a microphone array. The depth sensor measures the three-dimensional positions of objects in its space [
72]. Sensory used for activity recognition are typically classified as ambient and wearable sensors, where ambient sensors are attached to objects in the environment with which users interact [
74].
With the development of AI, new types of sensors and interactive devices emerged, enabling new ways for interaction, such as biometrics-based interaction that includes face recognition, fingerprint recognition, attitude recognition, and so on [
37]. It is sometimes argued that facial expression and tone of voice are also biological signals [
69]. Multimodal HCII systems usually enable technologies for processing active input mode with recognition-based and sensor-based input technologies and technologies for processing passive input using data from sensors (e.g., biosensors, ambient sensors, etc.) [
5].
In the HCII literature, various wearables have enabled the development of different recognition-based input modalities, such as hand-gesture recognition (e.g., wrist contour sensor device [
75], a wearable band with 6-axis [
76]), gesture-recognition in the ambient environment (e.g., haptic feedback + camera [
77]), head gesture recognition (e.g., MPU-6050 inertial sensor placed on audio headset [
55]), human body-posture recognition (e.g., accelerometer sensor attached to the chest, waist, or several body parts [
66]), stress-detection (e.g., Empatica E4 wristband [
71,
78], skin conductance sensor mounted on finger [
79], etc.), human–motion recognition (e.g., hierarchical helical yarn (HHY) sensor attached to different positions of the human body [
80], RespiBAN (chest-worn) [
71]). Hand gesture recognition can also be implemented using sensors technology, such as leap-motion sensor [
81,
82,
83], accelerometer [
76], RadSense (an end-to-end and unobtrusive system that uses Doppler radar-sensing) [
84], surface electromyogram (sEMG) [
54,
85], and so on.
Biometric sensors provide essential data that can be used to implement various solutions for recognizing users’ physiological and psychological states during the interaction, which can be used in various HCI scenarios. Emotion-detection can be implemented, for example, by processing data from electroencephalography (ECG) sensor [
86], galvanic skin response (GSR) sensor [
86], electromyographic (EMG) sensor [
86], photoplethysmography (PPG) sensor [
86,
87,
88], multi-biological sensor (e.g., PolyG-I (LAXTHA Inc., Daejeon, Korea) [
30], BIOPAC MP150 [
89]) providing different physical signals including EEG, ECG, EMG, PPG, GSR, and respiration (RESP). Biometric sensors were also successfully applied for implementing hand gesture-recognition solutions based on the eEMG sensor (e.g., [
54,
85]) and solutions for human-health monitoring [
70].
Audio and visual-based input modalities implemented using sensors were developed as well, such as speech recognition based on the audio signal acquired with a microphone (e.g., [
63,
90,
91]), facial expression recognition based on processing visual data from the camera [
92], human body posture recognition using data from the conventional gray level or color camera, thermal infrared sensor, depth sensor, smart vision sensor [
66], user-movement recognition using Kinect [
72], gesture recognition based on data from depth sensor [
93] and USB camera on a helmet [
94], emotion recognition with a laptop camera [
91], and so on. The Kinect can also be used to implement a solution for contact-free stress recognition, where the Kinect can provide respiration signals under different breathing patterns [
95]. Eyetracker was used for implementing various HCII solutions, such as cognitive-load assessment during the interaction [
96], contactless measurement of heart rate variability from pupillary fluctuations [
97], assistant virtual keyboard [
98], adaptive UIs [
96], autism spectrum disorder prediction [
99], etc.
In HCII literature, several solutions for stress recognition were proposed by processing data from various sensors. The physiological response reflects the sympathetic nervous system that can be measured by an ECG sensor, a respiration band sensor, and electrodermal activity (EDA) sensor [
79]. In [
100] the authors proposed a solution for stress recognition based on keystroke dynamics. Several stress-recognition solutions have combined multi sensors, e.g., visual images from a laptop camera and speech from a laptop microphone [
91], a wrist sensor (accelerometer and skin conductance sensor) [
101], multimodal wearable sensor (EEG, camera, GPS) [
78], a RespiBAN (chest-worn) and Empatica E4 (wrist-worn) sensor (ECG, EDA, EMG, Respiration and Temperature) [
71], webcam, Kinect, EDA, and GPS sensor [
102], BIOPAC MP150 (ECG from wrist, EMG from corrugator muscle, GSR from fingertips) and video from camera for offline analysis [
89], etc.
To enhance the quality of the communication and maximize the user’s well being during his or her interaction with the computer, the machine must understand the user’s state and automatically respond intelligently. For this, various data about the user’s behavior and state related to the interaction must be collected and analyzed. For example, understanding users’ emotions can be achieved through various measures, such as subjective self-reports, face tracking, voice analysis, gaze-tracking, and the analysis of autonomic and central neurophysiological measurements [
103].
Humans’ emotional reactions during HCI can trigger physiological changes that can be recognized using various modalities such as facial expressions, facial blood flow, speech, behavior (gesture/posture), and physiological signals. In existing HCII research the behavioral modeling and recognition uses various physiological signals, including the electrocardiogram (ECG), electromyogram (EMG), electroencephalogram (EEG), galvanic skin response (GSR), blood volume pressure (BVP), heart rate (HR) or heart rate variability (HRV), temperature (T), and respiration rate (RR) [
41]. The physiological signals respond to the human body’s central nervous system and automatic nervous system, which are voluntary reactions and are more objective [
104].
Emotions, for example, can trigger some minor changes in facial blood flow with an impact on skin temperature [
42] and speech [
105]. EEG-based emotion recognition, for example, has become crucial in enabling the HCII [
44] and has been globally accepted in many applications, such as intelligent thinking, decision-making, social communication, feeling detection, affective computing, etc. [
106]. Facial expression recognition can involve using sensors, such as cameras, eye-tracker, ECG, EMG, and EEG [
107]. The emotion recognition process often includes sensors for detecting physiological signals, which are not visible in the human eye and immediately reflect the emotional changes [
44]. Some of the current approaches to emotion recognition based on EEG mostly rely on various handcrafted features extracted over relatively long-time windows of EEG during the participants’ exposure to appropriate affective stimuli [
103].
One of the main challenges in HCII is the measurement of physiological signals (or biosignals), where the collection process uses invasive sensors that need to be in contact with the human body while recording. However, the ongoing research has enabled the use of non-invasive sensors as well. For example, innovative sensors, such as eye-trackers, enable the development of IUIs able to extract valuable and usable patterns of the users’ habits and ways of interaction [
64]. The non-invasive sEMG signal can be used to analyze the active state of the muscles and neural activities and performs well in artificial control, clinical diagnosis, motion detection, and neurological rehabilitation [
54]. Hand gesture recognition can be implemented by using a vision camera or wearable sensors [
108]. Wearable sensors and gesture recognition techniques have been used to develop wearable motion sensors for the hearing- and speech-impaired and wearable gesture-based gadgets for interaction with mobile devices [
109]. Recently, depth-based gesture recognition has received much intention in HCII as well [
110]. With the rapid development of IoT technologies, many intelligent sensing applications have emerged, which realize contactless sensing [
111].
2.3. Artificial Intelligence (AI) Methods and Algorithms for HCII
Artificial Intelligence (AI) is one of the most crucial components in the development of HCII and has already significantly impacted how users use and perceive contemporary IUI. The introduction of affective factors to HCIs resulted in developing an interdisciplinary research field, often called affective computing, which attempts to develop human-aware AI that can perceive, understand, and regulate emotions [
112]. Once computers understand humans’ emotions, AI will rise to a new level [
113]. In the HCSII research field, there is an increasing focus on developing emotional AI in HCI since emotion recognition using AI is a fundamental prerequisite to improve HCI [
106].
Machine-learning (ML) algorithms and methods can be categorized according to the learning style or similarity in form or function. When categorizing based on the learning style, ML approaches can be divided to following three categories [
114,
115]:
Supervised learning (SL) algorithms including classification, support vector machine (SVM), discriminant analysis, naïve Bayes (NB), k-Nearest Neighbor—k-NN, regression, linear regression (LR), ensemble algorithms, decision trees (DT), artificial neural network (ANN), extreme machine learning (ELM), relevance vector machine (RVM), Gaussian processes (GP), combined algorithms, etc.,
Unsupervised learning (UL) algorithms that include clustering, hierarchical ML, unsupervised Gaussian mixture (UGM), hidden Markov model—HMM, K-means, fuzzy c-means, neural networks (NN), etc.,
Reinforcement learning (RL) algorithms that include model-based RL, model-free RL, and RL-based adaptive controllers.
ML algorithms can also be categorized into single method-based algorithms and hybrid method-based ML algorithms [
116]. Single method-based algorithms include fuzzy logic (FL), ANN (e.g., perceptron, multilayer perceptrons—MLP, etc.), deep learning algorithms (DLA) (e.g., convolutional neural network—CNN, recurrent neural networks—RNNs, long short-term memory networks—LSTMs, etc.), Bayesian network (BN), genetic algorithm (GA), kernel method (e.g., SVM), logistic regression (LoR), and DT (e.g., J-48graft, random forest—RF). Hybrid method-based algorithms on the other side include fuzzy logic and natural language processing (FL-NLP), Bayesian network and recurrent neural network, long short-term memory and neural networks (LSTM-NN), etc.
Based on the similarity in terms of the algorithm’s’ function, ML algorithms can be divided into regression algorithms (e.g., LR, LoR, etc.), instance-based algorithms (e.g., k-NN, SVM, etc.), regularization algorithms (e.g., elastic net), decision tree algorithms (e.g., cClassification and regression tree, C4.5 and C5.0, etc.), Bayesian algorithms (e.g., NB, Gaussian NB, etc.), clustering algorithms (e.g., k-Means, k-Medians, etc.), association rule learning algorithms (e.g., Apriori algorithm), ANN algorithms (e.g., perceptron, MLP, RNN, etc.), dimensionality reduction algorithms (e.g., principal component analysis—PCA), ensemble algorithms (e.g., Adaboost), algorithms based on probabilistic models (e.g., Monte Carlo) and probabilistic graphical models (e.g., Bayesian network—BN), genetic algorithms, fuzzy logic- based algorithms, and other ML algorithms (e.g., feature selection algorithm, optimization algorithms, etc.).
In existing HCII- related research, various methods and algorithms have been proposed to accomplish the task of facial expression classification, including SVM, k-NN, NN, rule-based classifiers, and BN [
51]. For facial emotion recognition, for example, CNN was recognized as an effective method that can perform feature extraction and classification simultaneously and automatically discover multiple levels of representations in data [
50]. Audio-video emotion recognition, for example, is now researched and developed with deep neural network modeling tools [
117]. In the speech emotion recognition field, there is considerable attention in digital signal processing research. Researchers have developed different methods and algorithms for analyzing the emotional condition of an individual user with the focus on emotion classification by salient acoustic features of speech. Most researchers in speech emotion recognition have applied handcrafted features and machine learning techniques in recognizing speech emotion [
46]. In existing speech emotion recognition research, classical ML classifiers were used, such as the Markov model (MM), Gaussian mixed model (GMM), and SVM [
118]. Existing research has demonstrated that DLA effectively extract robust and salient features in the dataset [
46]. After the ANN breakthrough, especially CNN, the neural approach has become the main one for creating intelligent computer vision systems [
119]. CNN is currently the most widely used deep-learning model for image recognition [
63].
This study is interested in AI methods and techniques for HCII solutions available in the existing literature. We are interested in both HCII solutions validated using data from sensor technology and HCII solutions validated using data from publicly available databases.
2.4. Related Studies
In the existing literature in HCI and HCII, several studies of systematic literature reviews (SLR) and systematic mapping studies (SMS) have been conducted in the last ten years. This fact indicates that the large body of work encouraged researchers to create a joint knowledge base in this field. Although we can find some parallels between the existing SLR and SMS studies and our study, as the existing studies also dealt with a specific topic related to HCI and HCII, certain differences make our study the first such study in HCII.
We see the first difference when reviewing the keywords based on which existing SLR and SMS studies have systematically acquired and analyzed the existing literature. The most common keywords were HCI (in 7 studies), followed by AI (in 3 studies), IoT (in 3 studies), and EMG (in 3 studies). Others were IUI (in 2 studies), robustness (in 2 studies), accuracy (in 2 studies), smart home (in 2 studies), and deep learning (in 2 studies), followed by over 40 different keywords, particular to different domains, indicating that HCI is an important research field and is integrated into various domains. Furthermore, none of the existing studies aimed to analyze HCII and IUI literature in general and provide a standard overview of sensor technology and machine-learning methods and algorithms used for other HCII developments.
In [
9], authors investigated how to approach the design of human-centric IT systems and what they represent. A study published ten years later [
10], has still been questioning what is actually deemed intelligent, and surprisingly still examining the question that should have been answered a decade ago. In one of the most extensive reviews of literature related to the HCI field [
120], authors analyzed 3243 articles, examining publication growth, geographical distribution, citation analysis research productivity, and keywords, and identified the following five clusters: (1) UI for user centric design, (2) HCI, (3) interaction design, (4) intelligent interaction recognition research, and (5) e-health and health information. The main conclusion in [
120] was that the research in this field has little consistency, the researchers start addressing newer technologies, and there is no accumulated knowledge. A similar attempt at visualizing popular clusters for a specific decade was also observed in [
121].
Regardless of how potentially unorganized research in HCI might be, the benefits of its use are evident in several other research studies. The benefits of HCI solutions are the main focus of various SLR and SMS studies (e.g., [
11,
12,
122]). For example, the authors in [
12] investigated wearable devices, arguing that they have brought the highest level of convenience and assistance to people than ever before. The findings are supported by the study results conducted in [
11], where authors addressed the benefits of wearable devices for the aging population with chronic diseases, potentially reducing the social and economic burdens. The importance of HCI in healthcare was furthermore analyzed in [
13,
122], where the research was focused on people with disabilities and related health problems. The role of AI technology for activity recognition, data processing, decision making, image recognition, prediction making, and voice recognition in smart home interactive solutions was also analyzed [
17]. Some existing SLR studies are also focused on the general use of HCI solutions, providing support in healthcare [
122], smart living [
17], or understanding human emotions from speech [
18].
The second focus of published SLR and SMS studies are sensors, signals, and the intelligent use of different devices. Authors in [
12] investigated types of wearable devices for general users. At the same time, [
13] addressed the benefits of ambient-assisted living and IUI for people with special needs, concluding how important it is to design user-friendly interfaces to provide an excellent HCI mechanism that fits the needs of all users. A similar effort was made in [
123], only this time the general user population is included, indicating that several “general solutions” do not always have user-friendly interfaces. The level of intelligence is low, and they need to be improved.
The importance of well-designed and well-interpreted HCII can be presented as the third focus of studies. For example, in [
37], authors recognized new HCI scenarios, such as smart homes and driverless cars. In [
124], augmented reality (AR) and the third generation of AI technology are investigated. However, both studies lack a systematic review (a lower number of literature units indicate limited research space in these specific topics). The risk of misinterpretation of signals and the connecting risks are addressed in [
15] (EMG) and in [
16] (EEG), both claiming there are several issues in this area. The review of [
15] focuses on deep learning in EMG decoding, while [
16] is set to find different good practices within existing research.
Benefits of IoT and HCI collaboration are addressed within the application of IoT systems, emphasizing the influence of human factors while using HCI [
125]. The authors in [
125] address the advantages of information visualization, cognition, and human trust in intelligent systems. In contrast, the authors in [
126] present a unified framework for deriving and analyzing adaptive and scalable network design and resource allocation schemes for IoT.
In the last decade, there was an emphasis on developing solutions for mobile devices. SLRs related to the HCII field on mobile devices are focused on mobile emotion recognition methods, primarily but not exclusively addressing smartphone devices. The authors in [
21] deliver a systematic overview of publications from the past ten years addressing smartphone emotion recognition, providing a detailed presentation of 75 studies. Meyer et al. [
20] also analyzed the existing research field of mobile emotion measurement and recognition. By conducting a literature review, they were focused on optical emotion recognition or face recognition, acoustic emotion recognition or speech recognition, behavior-based emotion recognition or gesture recognition, and vital-data-based emotion recognition or biofeedback recognition. Research, conducted by Tzafilkou et al. [
19] addressed the use of non-intrusive mobile sensing methodologies for emotion recognition in smartphone devices, narrowing the timescale and number of papers even more: 30 articles during the past six years. Similar to the findings of our study, the authors identified a peak of papers published in 2016 and 2017. Based on the results that revealed main research trends and gaps in the field, the authors discussed research challenges and considerations of practical implications for the design of emotion-aware systems within the context of distance education.