Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
introduction
Free access

Special Issue on Deep Learning for Intelligent Human Computer Interaction

Published: 25 September 2023 Publication History
Deep Learning (DL) is growing at a fast pace with a plethora of related research being conducted and industrial applications being developed. Robotics, computer graphics, computer vision including areas such as feature extraction/matching, 3D reconstruction, manufacturing, medicine, knowledge acquisition, control theory, planning and scheduling, among others, have uncovered the potential of DL. The reason for such success mainly stems from two aspects. On the one hand, the theoretical advances in related disciplines such as optimization methods, pattern recognition, and hardware (GPUs) have undergone great breakthroughs. On the other hand, the technical applications in industry have addressed many actual problems that have in turn accelerated DL development. Leading technology companies like Microsoft, Apple, Google, Facebook, Nvidia, and Amazon all launched cutting-edge industrial products offering advanced functionalities that were only made possible thanks to the employment of DL algorithms. The effort in establishing new media and network technologies led to the identification of a niche for ML in the field of human-computer interaction.
Human-computer interaction (HCI) is a multidisciplinary field of study focusing on the design of computer technology through the analysis and understanding of how humans interact with computers. HCI addresses various areas, such as user interface design or computer-supported cooperative work for data modeling, system adaptation and optimization to fit user needs, complex systems, the development of smart equipment to ease utilization, and the design of smart environments to promote user comfort and safety. Nowadays, DL is largely employed in HCI to model human behavior as they have shown to outperform traditional modeling techniques. Although their potential is undoubtedly broad, there are several important research challenges that remain unaddressed. The explainability and interpretability of the decision-making process of DL approaches is key to assess models’ behaviors, especially when unexpected or incorrect outputs occur. The complexity and the dimensionality of the underlying mathematical models are among the principal causes that hinder understanding. Privacy must be guaranteed as sensible private data is typically continually collected, processed, and stored over the Cloud in order to provide improved user-specific experiences. Such continual learning for DL approaches is itself a challenge as models may suffer from catastrophic forgetting, but it is a key ingredient that needs to be taken into consideration to enable scalability. Data annotation done at large scale is costly and hardly sustainable, therefore particular attention must be put on self-supervised or unsupervised learning mechanisms. DL-based user interfaces, such as gesture recognition, must be supported too in order to offer users seamless and natural user interaction means. Moreover, user behavior prediction is fundamental to allow for fully-fledged anticipatory computing systems.
The objective of this special issue is to generate a comprehensive understanding of deep learning in the field of intelligent HCI for both theoretical and practical applications. After a rigorous review, 14 articles have been accepted for this special issue. A summary of articles in this special issue is presented in the following:
In the article “A Novel GAPG Approach to Automatic Property Generation for Formal Verification: The GAN Perspective”, Gao et al. propose a GAN-based automatic property generation (GAPG) approach to generate verification properties supporting model checking. First, the verification properties in the form of computational tree logic (CTL) are encoded and used as input to the GAN. Second, they introduce regular expressions as grammar rules to check the correctness of the generated properties. These rules work to detect and filter meaningless properties that occur because the GAN learning process is uncontrollable and may generate unsuitable properties for real applications. Third, the learning network is further trained by using labeled information associated with the input properties.
The authors of the article titled “Meta-MMFNet: Meta-Learning Based Multi-Model Fusion Network for Micro-Expression Recognition” propose a meta-learning based multi model fusion network (Meta-MMFNet) to solve the existing automatic micro-expression recognition problems. The proposed method is based on the metric-based meta-learning pipeline, which is specifically designed for few-shot learning and is suitable for model-level fusion. The frame difference and optical flow features were fused, deep features were extracted from the fused feature, and finally in the meta-learning-based framework, weighted sum model fusion method was applied for micro-expression classification.
In the article titled “An Efficient and Accurate GPU-based Deep Learning Model for Multimedia Recommendation”, Djenouri et al. propose the use of deep learning in human-computer interaction and present a new explainable hybrid framework for recommending relevant hashtags on a set of orpheline tweets orpheline tweet: It is a tweet with hashtags. The approach starts by determining the set of batches used in the convolution neural network based on frequent pattern mining solutions. The convolutional neural network is then applied to the set of batches of tweets to learn the hashtags of the tweets. An optimization strategy has been proposed to accurately perform the learning process by reducing the number of frequent patterns. Moreover, eXplainable AI (XAI) is introduced for hashtag recommendations by analyzing the user preferences and understanding the different weights of the deep learning model used in the learning process.
In the article “Explanation-driven HCI Model to Examine the Mini-Mental State for Alzheimer's Disease”, The authors propose a robust explainable HCI model using shapley additive explanation (SHAP), local interpretable model-agnostic explanations (LIME), and DL algorithms. The use of DL algorithms: logistic regression (80.87%), support vector machine (85.8%), k-nearest neighbor (87.24%), multilayer perceptron (91.94%), decision tree (100%) and explainability can help exploring untapped avenues for research in medical sciences that can mould the future of HCI models. The outcomes of the proposed model depict higher prediction accuracy bringing efficient computer interface in decision making and suggests a high level of relevance in the field of medical and clinical research.
In the article “Automatic Assessment of Depression and Anxiety through Encoding Pupil wave from HCI in VR Scenes”, pupil-wave, a physiological signal collected by Human Computer Interaction (HCI) that can directly represent the emotional state, is developed to assess the level of depression and anxiety for the first time. To distinguish between different depression and anxiety levels, authors use HCI method to induce the participants’ emotional experience through three virtual reality (VR) emotional scenes of joyful, sad, and calm, and construct two differential pupil-waves of joyful and sad with the calm pupil-wave as the baseline. Correspondingly, a dual channel fusion depression and anxiety level assessment model is constructed using the improved multi-scale convolution module and their proposed width-channel attention module for one-dimensional signal processing. This study provides an automatic assessment technique based on human computer interaction and virtual reality for mental health physical examination.
In the article “Spontaneous Facial Behavior Analysis using Deep Transformer Based Framework for Child–Computer Interaction”, authors analyzed the child's spontaneous behavior using multimodal facial expression and voice signal presenting multimodal transformer-based last feature fusion for facial behavior analysis in children to extract contextualized representations from RGB video sequence and Hematoxylin and eosin video sequence and then using these representations followed by pairwise concatenations of contextualized representations using cross-feature fusion technique to predict users emotions. Besides, they perform t-distributed stochastic neighbor embedding visualization to visualize the discriminative feature in lower dimension space and probability density estimation to visualize the prediction capability of their proposed model.
In the article “Full-body Human Motion Reconstruction with Sparse Joint Tracking Using Flexible Sensors”, Chen et al. propose a novel framework, which can accurately predict the human joint moving angles from the signals of only four flexible sensors, thereby achieving the tracking of human joints in multi-degrees of freedom. Specifically, they mitigate the collapse to average poses by implementing the model with a Bi-LSTM neural network that makes full use of short-time sequence information; they reduce jitters by adding a median pooling layer to the network, which smooths consecutive motions. Although being bio-compatible and ideal for improving the wearing experience, the flexible sensors are prone to aging which increases prediction errors. Observing that the aging of flexible sensors usually results in drifts of their resistance ranges, authors further propose a novel dynamic calibration technique to rescale sensor ranges, which further improves the prediction accuracy.
In the article “Self Supervised Learning of Depth and Ego-motion for 3D Perception in Human Computer Interaction”, Qiao et al. propose a novel self -supervised method based on Vision Transformer (ViT) with Convolutional Neural Network (CNN) architecture, which is referred to as ViT-Depth. The image reconstruction losses computed by the estimated depth and motion between adjacent frames are treated as supervision signal to establish a self-supervised learning pipeline. This is an effective solution for tasks that need accurate and low -cost 3D perception, such as autonomous driving, robotic navigation, 3D reconstruction, and so on. The method could leverage both the ability of CNN and Transformer to extract deep features and capture global contextual information. In addition, they propose a cross -frame loss that could constrain photometric error and scale consistency among multi -frames, which lead the training process to be more stable and improve the performance.
In the article titled “A Deep Graph Network with Multiple Similarity for User Clustering in Human-Computer Interaction”, Kang et al. focus on improving the clustering performance of users’ attributes in HCI and propose a deep graph embedding network with feature and structure similarity (called DGENFS) to cluster users’ attributes in HCI applications based on feature and structure similarity. The DGENFS model consists of a Feature Graph Autoencoder (FGA) module, a Structure Graph Attention Network (SGAT) module, and a Dual Self-supervision (DSS) module. The FGA and SGAT modules are utilized to extract the representations of human features and topological space, respectively. To learn cluster-friendly features, the DSS module unifies and integrates the features learned from the FGA and SGAT modules. DSS explores the high-confidence cluster assignment as a soft label to guide the optimization of the entire network.
In the article “A Study of Human-AI Symbiosis for Creative Work: Recent Developments and Future Directions in Deep Learning”, Mahmud et al. investigate some promising different directions of AI development: instead of using AI to replace people, people should use AI to team up with people so that both can work better and smarter. Some research work on human-AI collaborative environments has been extensively studied and analyzed to reveal the progress in this field. Although the teaming of humans and machines includes a lot of complex tasks, the development has been very promising. One of the main goals in this field is to develop additional capabilities in machines capable of being successful teammates with a human partner. Authors conclude that the teaming of humans and AI, particularly deep learning, has the advantage of combining the power of AI with the human domain expertise to improve performance and create value. Human-AI symbiosis could be a promising future direction for AI's continuing integration into the world.
In the article titled “PAINT: Photo-realistic Fashion Design Synthesis”, authors investigate a new problem of generating a variety of multi-view fashion designs conditioned on a human pose and texture examples of arbitrary sizes, which can replace the repetitive and low-level design work for fashion designers. To solve this challenging multi-modal image translation problem, they propose a novel Photo-reAlistic fashIon desigN synThesis (PAINT) framework, which decomposes the framework into three manageable stages. In the first stage, authors employ a Layout Generative Network (LGN) to transform an input human pose into a series of person semantic layouts. In the second stage, they propose a Texture Synthesis Network (TSN) to synthesize textures on all transformed semantic layouts. Specifically, authors design a novel attentive texture transfer mechanism for precisely expanding texture patches to the irregular clothing regions of the target fashion designs. In the third stage, they leverage an Appearance Flow Network (AFN) to generate the fashion design images of other viewpoints from a single-view observation by learning 2D multi-scale appearance flow fields.
In the article “Unsupervised Domain Adaptation by Causal Learning for Biometric Signal based HCI”, Dai et al. propose a novel unsupervised domain adaptation approach for biometric signal based HCI via causal representation learning. Specifically, three kinds of interventions on biometric signals (i.e., subjects, sessions, and trials) can be selected to generalize deep models across the selected intervention. In the proposed approach, a generative model is trained for producing intervened features that are subsequently used for learning transferable and causal relations with three modes. Experiments on the EEG-based emotion recognition task and sEMG-based gesture recognition task are conducted to confirm the superiority of approach.
In the article titled “Realtime Recognition of Dynamic Hand Gestures in Practical Applications”, two methods are proposed by Xiao et al. to avoid dynamic hand gesture misrecognition: introducing activation delay and using asymmetric gesture design. Firstly, the temporal process of a dynamic hand gesture is decomposed and redefined, then a real-time dynamic hand gesture recognition system is built through a two-dimensional convolutional neural network. In order to investigate the influence of activation delay and asymmetric gesture design on system performance, a user study is conducted. The two methods proposed in this article can provide valuable guidance for researchers when designing real-time recognition system in practical applications.
In the article “Hierarchical Multi-Attention Transfer for Knowledge Distillation”, Gou et al. propose a hierarchical multi-attention transfer framework (HMAT), where different types of attention are utilized to transfer the knowledge at different levels of deep representation learning for knowledge distillation. Specifically, position-based and channel-based attention knowledge characterize the knowledge from low-level and high-level feature representations respectively, and activation-based attention knowledge characterize the knowledge from both mid-level and high-level feature representations.
In closing, the guest editors would like to thank all the authors, who significantly contributed to this SI and the reviewers for their efforts in respecting deadlines and their constructive reviews. We thank the authors for their contributions, the reviewers for their valuable work, and the editorial team of the journal for their professional support and collaboration.
Zhihan Lv
Uppsala University, China
Fabio Poiesi
Fondazione Bruno Kessler, Italy
Qi Dong
Amazon AWS AI, USA
Jaime Lloret
Universitat Politecnica de Valencia, Valencia, Spain
Houbing Song
Embry-Riddle Aeronautical University, USA
Guest Editors

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 2
February 2024
548 pages
EISSN:1551-6865
DOI:10.1145/3613570
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 September 2023
Published in TOMM Volume 20, Issue 2

Check for updates

Qualifiers

  • Introduction

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 298
    Total Downloads
  • Downloads (Last 12 months)279
  • Downloads (Last 6 weeks)69
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media