Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Inspired by the success of the Transformer architecture in natural language processing and computer vision, we investigate the use of Transformers in Reinforcement Learning (RL), specifically in modeling the environment's dynamics using... more
Inspired by the success of the Transformer architecture in natural language processing and computer vision, we investigate the use of Transformers in Reinforcement Learning (RL), specifically in modeling the environment's dynamics using Transformer Dynamics Models (TDMs). We evaluate the capabilities of TDMs for continuous control in real-time planning scenarios with Model Predictive Control (MPC). While Transformers excel in long-horizon prediction, their tokenization mechanism and autoregressive nature lead to costly planning over long horizons, especially as the environment's dimensionality increases. To alleviate this issue, we use a TDM for short-term planning, and learn an autoregressive discrete Q-function using a separate Q-Transformer (QT) model to estimate a long-term return beyond the short-horizon planning. Our proposed method, QT-TDM, integrates the robust predictive capabilities of Transformers as dynamics models with the efficacy of a model-free Q-Transformer to mitigate the computational burden associated with real-time planning. Experiments in diverse state-based continuous control tasks show that QT-TDM is superior in performance and sample efficiency compared to existing Transformer-based RL models while achieving fast and computationally efficient inference.
Architectures for vision-based robot manipulation often utilize separate domain adaption models to allow sim-to-real transfer and an inverse kinematics solver to allow the actual policy to operate in Cartesian space. We present a novel... more
Architectures for vision-based robot manipulation often utilize separate domain adaption models to allow sim-to-real transfer and an inverse kinematics solver to allow the actual policy to operate in Cartesian space. We present a novel end-to-end visuomotor architecture that combines domain adaption and inherent inverse kinematics in one model. Using the same latent encoding, it jointly learns to reconstruct canonical simulation images from randomized inputs and to predict the corresponding joint angles that minimize the Cartesian error towards a depicted target object via differentiable forward kinematics. We evaluate our model in a sim-to-real grasping experiment with the NICO humanoid robot by comparing different randomization and adaption conditions both directly and with additional real-world finetuning. Our combined method significantly increases the resulting accuracy and allows a finetuned model to reach a success rate of 80.30%, outperforming a real-world model trained with six times as much real data.
We address the Continual Learning (CL) problem, where a model has to learn a sequence of tasks from non-stationary distributions while preserving prior knowledge as it encounters new experiences. With the advancement of foundation models,... more
We address the Continual Learning (CL) problem, where a model has to learn a sequence of tasks from non-stationary distributions while preserving prior knowledge as it encounters new experiences. With the advancement of foundation models, CL research has shifted focus from the initial learning-from-scratch paradigm to the use of generic features from large-scale pre-training. However, existing approaches to CL with pre-trained models only focus on separating the class-specific features from the final representation layer and neglect the power of intermediate representations that capture low- and mid-level features naturally more invariant to domain shifts. In this work, we propose LayUP, a new class-prototype-based approach to continual learning that leverages second-order feature statistics from multiple intermediate layers of a pre-trained network. Our method is conceptually simple, does not require any replay buffer, and works out of the box with any foundation model. LayUP improves over the state-of-the-art on four of the seven class-incremental learning settings at a considerably reduced memory and computational footprint compared with the next best baseline. Our results demonstrate that fully exhausting the representational capacities of pre-trained models in CL goes far beyond their final embeddings.
Imitation can allow us to quickly gain an understanding of a new task. Through a demonstration, we can gain direct knowledge about which actions need to be performed and which goals they have. In this paper, we introduce a new approach to... more
Imitation can allow us to quickly gain an understanding of a new task. Through a demonstration, we can gain direct knowledge about which actions need to be performed and which goals they have. In this paper, we introduce a new approach to imitation learning that tackles the challenges of a robot imitating a human, such as the change in perspective and body schema. Our approach can use a single human demonstration to abstract information about the demonstrated task, and use that information to generalise and replicate it. We facilitate this ability by a new integration of two state-of-the-art methods: a diffusion action segmentation model to abstract temporal information from the demonstration and an open vocabulary object detector for spatial information. Furthermore, we refine the abstracted information and use symbolic reasoning to create an action plan utilising inverse kinematics, to allow the robot to imitate the demonstrated action.
Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with... more
Recent advancements in large language models have showcased their remarkable generalizability across various domains. However, their reasoning abilities still have significant room for improvement, especially when confronted with scenarios requiring multi-step reasoning. Although large language models possess extensive knowledge, their reasoning often fails to effectively utilize this knowledge to establish a coherent thinking paradigm. These models sometimes show hallucinations as their reasoning procedures are unconstrained by logical principles. Aiming at improving the zero-shot chain-of-thought reasoning ability of large language models, we propose LoT (Logical Thoughts), a self-improvement prompting framework that leverages principles rooted in symbolic logic, particularly Reductio ad Absurdum, to systematically verify and rectify the reasoning processes step by step. Experimental evaluations conducted on language tasks in diverse domains, including arithmetic, commonsense, symbolic, causal inference, and social problems, demonstrate the efficacy of enhanced reasoning by logic. The implementation code for LoT can be accessed at: https://github.com/xf-zhao/LoT
This paper introduces a novel zero-shot motion planning method that allows users to quickly design smooth robot motions in Cartesian space. A Bézier curve-based Cartesian plan is transformed into a joint space trajectory by our... more
This paper introduces a novel zero-shot motion planning method that allows users to quickly design smooth robot motions in Cartesian space. A Bézier curve-based Cartesian plan is transformed into a joint space trajectory by our neuro-inspired inverse kinematics (IK) method CycleIK, for which we enable platform independence by scaling it to arbitrary robot designs. The motion planner is evaluated on the physical hardware of the two humanoid robots NICO and NICOL in a human-in-the-loop grasping scenario. Our method is deployed with an embodied agent that is a large language model (LLM) at its core. We generalize the embodied agent, that was introduced for NICOL, to also be embodied by NICO. The agent can execute a discrete set of physical actions and allows the user to verbally instruct various different robots. We contribute a grasping primitive to its action space that allows for precise manipulation of household objects. The new CycleIK method is compared to popular numerical IK solvers and state-of-the-art neural IK methods in simulation and is shown to be competitive with or outperform all evaluated methods when the algorithm runtime is very short. The grasping primitive is evaluated on both NICOL and NICO robots with a reported grasp success of 72% to 82% for each robot, respectively.
Intention-based Human-Robot Interaction (HRI) systems allow robots to perceive and interpret user actions to proactively interact with humans and adapt to their behavior. Therefore, intention prediction is pivotal in creating a natural... more
Intention-based Human-Robot Interaction (HRI) systems allow robots to perceive and interpret user actions to proactively interact with humans and adapt to their behavior. Therefore, intention prediction is pivotal in creating a natural interactive collaboration between humans and robots. In this paper, we examine the use of Large Language Models (LLMs) for inferring human intention during a collaborative object categorization task with a physical robot. We introduce a hierarchical approach for interpreting user non-verbal cues, like hand gestures, body poses, and facial expressions and combining them with environment states and user verbal cues captured using an existing Automatic Speech Recognition (ASR) system. Our evaluation demonstrates the potential of LLMs to interpret non-verbal cues and to combine them with their context-understanding capabilities and real-world knowledge to support intention prediction during human-robot interaction.
Humanoid robots can benefit from their similarity to the human shape by learning from humans. When humans teach other humans how to perform actions, they often demonstrate the actions and the learning human can try to imitate the... more
Humanoid robots can benefit from their similarity to the human shape by learning from humans. When humans teach other humans how to perform actions, they often demonstrate the actions and the learning human can try to imitate the demonstration. Being able to mentally transfer from a demonstration seen from a third-person perspective to how it should look from a first-person perspective is fundamental for this ability in humans. As this is a challenging task, it is often simplified for robots by creating a demonstration in the first-person perspective. Creating these demonstrations requires more effort but allows for an easier imitation. We introduce a novel diffusion model aimed at enabling the robot to directly learn from the third-person demonstrations. Our model is capable of learning and generating the first-person perspective from the third-person perspective by translating the size and rotations of objects and the environment between two perspectives. This allows us to utilise the benefits of easy-to-produce third-person demonstrations and easy-to-imitate first-person demonstrations. The model can either represent the first-person perspective in an RGB image or calculate the joint values. Our approach significantly outperforms other image-to-image models in this task.
Although there has been rapid progress in endowing robots with the ability to solve complex manipulation tasks, generating control policies for bimanual robots to solve tasks involving two hands is still challenging because of the... more
Although there has been rapid progress in endowing robots with the ability to solve complex manipulation tasks, generating control policies for bimanual robots to solve tasks involving two hands is still challenging because of the difficulties in effective temporal and spatial coordination. With emergent abilities in terms of step-by-step reasoning and in-context learning, Large Language Models (LLMs) have taken control of a variety of robotic tasks. However, the nature of language communication via a single sequence of discrete symbols makes LLM-based coordination in continuous space a particular challenge for bimanual tasks. To tackle this challenge for the first time by an LLM, we present LAnguage-model-based Bimanual ORchestration (LABOR), an agent utilizing an LLM to analyze task configurations and devise coordination control policies for addressing long-horizon bimanual tasks. In the simulated environment, the LABOR agent is evaluated through several everyday tasks on the NICOL humanoid robot. Reported success rates indicate that overall coordination efficiency is close to optimal performance, while the analysis of failure causes, classified into spatial and temporal coordination and skill selection, shows that these vary over tasks. The project website can be found at https://labor-agent.github.io/
Reinforcement learning (RL) is a powerful technique for training intelligent agents, but understanding why these agents make specific decisions can be quite challenging. This lack of transparency in RL models has been a long-standing... more
Reinforcement learning (RL) is a powerful technique for training intelligent agents, but understanding why these agents make specific decisions can be quite challenging. This lack of transparency in RL models has been a long-standing problem, making it difficult for users to grasp the reasons behind an agent's behaviour. Various approaches have been explored to address this problem, with one promising avenue being reward decomposition (RD). RD is appealing as it sidesteps some of the concerns associated with other methods that attempt to rationalize an agent's behaviour in a post-hoc manner. RD works by exposing various facets of the rewards that contribute to the agent's objectives during training. However, RD alone has limitations as it primarily offers insights based on sub-rewards and does not delve into the intricate cause-and-effect relationships that occur within an RL agent's neural model. In this paper, we present an extension of RD that goes beyond subrewards to provide more informative explanations. Our approach is centred on a causal learning framework that leverages information-theoretic measures for explanation objectives that encourage three crucial properties of causal factors: causal sufficiency, sparseness, and orthogonality. These properties help us distill the cause-and-effect relationships between the agent's states and actions or rewards, allowing for a deeper understanding of its decision-making processes. Our framework is designed to generate local explanations and can be applied to a wide range of RL tasks with multiple reward channels. Through a series of experiments, we demonstrate that our approach offers more meaningful and insightful explanations for the agents action selections.
Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we... more
Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to generate similar numbers of samples for each class to tackle the data imbalance issue as well. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that the augmented model not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Mirroring non-verbal social cues such as affect or movement can enhance human-human and human-robot interactions in the real world. The robotic platforms and control methods also impact people's perception of human-robot interaction.... more
Mirroring non-verbal social cues such as affect or movement can enhance human-human and human-robot interactions in the real world. The robotic platforms and control methods also impact people's perception of human-robot interaction. However, limited studies have compared robot imitation across different platforms and control methods. Our research addresses this gap by conducting two experiments comparing people's perception of affective mirroring between the iCub and Pepper robots and movement mirroring between vision-based iCub control and Inertial Measurement Unit (IMU)-based iCub control. We discovered that the iCub robot was perceived as more humanlike than the Pepper robot when mirroring affect. A vision-based controlled iCub outperformed the IMU-based controlled one in the movement mirroring task. Our findings suggest that different robotic platforms impact people's perception of robots' mirroring during HRI. The control method also contributes to the robot's mirroring performance. Our work sheds light on the design and application of different humanoid robots in the real world.
Message oriented and robotics middleware play an important role in facilitating robot control, abstracting complex functionality, and unifying communication patterns between sensors and devices. However, using multiple middleware... more
Message oriented and robotics middleware play an important role in facilitating robot control, abstracting complex functionality, and unifying communication patterns between sensors and devices. However, using multiple middleware frameworks presents a challenge in integrating different robots within a single system. To address this challenge, we present Wrapyfi, a Python wrapper supporting multiple message oriented and robotics middleware, including ZeroMQ, YARP, ROS, and ROS 2. Wrapyfi also provides plugins for exchanging deep learning framework data, without additional encoding or preprocessing steps. Using Wrapyfi eases the development of scripts that run on multiple machines, thereby enabling cross-platform communication and workload distribution. We finally present the three communication schemes that form the cornerstone of Wrapyfi's communication model, along with examples that demonstrate their applicability.
Weakly supervised referring expression comprehension (REC) aims to ground target objects in images according to given referring expressions, while the mappings between image regions and referring expressions are unavailable during the... more
Weakly supervised referring expression comprehension (REC) aims to ground target objects in images according to given referring expressions, while the mappings between image regions and referring expressions are unavailable during the model training phase. Existing models typically reconstruct the multimodal relationships to ground targets by utilizing off-the-shelf information, and ignore to further exploit helpful knowledge to enhance the model performance. To address this issue, we propose an adaptive knowledge distillation architecture to enrich the predominant pattern of weakly supervised REC and transfer the target-aware and interaction-aware knowledge from a pre-trained teacher grounder to enhance the grounding performance of the student model. Specifically, in order to encourage the teacher to impart more reliable knowledge, we present a Knowledge Confidence-Based Adaptive Temperature (KCAT) learning approach to learn optimal temperatures to transfer the target-aware and interaction-aware knowledge with higher prediction confidence. Moreover, to urge the student to absorb more helpful knowledge, we introduce a Student Competency-Based Adaptive Weight (SCAW) learning strategy to dynamically integrate the distilled target-aware and interaction-aware knowledge to enhance the student’s grounding certainty. We conduct extensive experiments on three benchmark datasets, RefCOCO, RefCOCO+, and RefCOCOg, to validate the proposed approach. Experimental results demonstrate that our approach achieves superior performance over state-of-the-art methods with the aid of adaptive knowledge distillation and integration. The code and trained models are available at: https://github.com/dami23/WREC_AdaptiveKD
We investigate the use of Large Language Models (LLMs) to equip neural robotic agents with human-like social and cognitive competencies, for the purpose of open-ended human-robot conversation and collaboration. We introduce a modular and... more
We investigate the use of Large Language Models (LLMs) to equip neural robotic agents with human-like social and cognitive competencies, for the purpose of open-ended human-robot conversation and collaboration. We introduce a modular and extensible methodology for grounding an LLM with the sensory perceptions and capabilities of a physical robot, and integrate multiple deep learning models throughout the architecture in a form of system integration. The integrated models encompass various functions such as speech recognition, speech generation, open-vocabulary object detection, human pose estimation, and gesture detection, with the LLM serving as the central text-based coordinating unit. The qualitative and quantitative results demonstrate the huge potential of LLMs in providing emergent cognition and interactive language-oriented control of robots in a natural and social manner. Video: https://youtu.be/A2WLEuiM3-s
Imitation can allow us to quickly gain an understanding of a new task. Through a demonstration, we can gain direct knowledge about which actions need to be performed and which goals they have. In this paper, we introduce a new approach to... more
Imitation can allow us to quickly gain an understanding of a new task. Through a demonstration, we can gain direct knowledge about which actions need to be performed and which goals they have. In this paper, we introduce a new approach to imitation learning that tackles the challenges of a robot imitating a human, such as the change in perspective and body schema. Our approach can use a single human demonstration to abstract information about the demonstrated task, and use that information to generalise and replicate it. We facilitate this ability by a new integration of two state-of-the-art methods: a diffusion action segmentation model to abstract temporal information from the demonstration and an open vocabulary object detector for spatial information. Furthermore, we refine the abstracted information and use symbolic reasoning to create an action plan utilising inverse kinematics, to allow the robot to imitate the demonstrated action.
Endowing robots with the human ability to learn a growing set of skills over the course of a lifetime as opposed to mastering single tasks is an open problem in robot learning. While multitask learning approaches have been proposed to... more
Endowing robots with the human ability to learn a growing set of skills over the course of a lifetime as opposed to mastering single tasks is an open problem in robot learning. While multitask learning approaches have been proposed to address this problem, they pay little attention to task inference. In order to continually learn new tasks, the robot first needs to infer the task at hand without requiring predefined task representations. In this article, we propose a self-supervised task inference approach. Our approach learns action and intention embeddings from self-organization of the observed movement and effect parts of unlabeled demonstrations and a higher level behavior embedding from self-organization of the joint action-intention embeddings. We construct a behavior-matching self-supervised learning objective to train a novel task inference network (TINet) to map an unlabeled demonstration to its nearest behavior embedding, which we use as the task representation. A multitask policy is built on top of the TINet and trained with reinforcement learning to optimize performance over tasks. We evaluate our approach in the fixed-set and continual multitask learning settings with a humanoid robot and compare it to different multitask learning baselines. The results show that our approach outperforms the other baselines, with the difference being more pronounced in the challenging continual learning setting, and can infer tasks from incomplete demonstrations. Our approach is also shown to generalize to unseen tasks based on a single demonstration in one-shot task generalization experiments.
Natural language processing and vision tasks have recently seen large improvements through the rise of Transformer architectures. The high-performing large language models (LLMs) benefit from large textual datasets that are numerously... more
Natural language processing and vision tasks have recently seen large improvements through the rise of Transformer architectures. The high-performing large language models (LLMs) benefit from large textual datasets that are numerously available online. However, action and bidirectional action-language tasks are less developed, as these require more specific and labeled data. Therefore, we aim at enabling these robotic action capabilities for a pretrained LLM, while maintaining high efficiency with regards to the required training time and data size. To achieve this, we split up a Transformer-based LLM and insert a multimodal architecture into it. Specifically, we split a pretrained T5 LLM between its encoder and decoder parts, to insert a crossmodal Transformer component of a Paired Transformed Autoencoders (PTAE) bidirectional action-language model. The experiments are conducted on a new dataset, consisting of unimodal language translation and crossmodal bidirectional action-language translation. The natural language capabilities of the original T5 are re-established efficiently by training the crossmodal Transformer, which requires only one 5.7 millionth of the T5 model's original training data. Furthermore, the new model, called CrossT5, achieves high accuracy for the vision-and language-guided robotic action tasks. By design, the CrossT5 agent acts robustly when tested with language commands not included in the dataset. The results demonstrate that this novel approach is successful in combining the advanced linguistic capabilities of LLMs with the low-level robotic control skills of vision-action models. The code is available at this URL: https://github.com/samsoneko/CrossT5.
Interdisciplinary research, drawing from robotics, artificial intelligence, neuroscience, psychology, and cognitive science, is a cornerstone to advance the state-of-the-art in multimodal human-robot interaction and neuro-cognitive... more
Interdisciplinary research, drawing from robotics, artificial intelligence, neuroscience, psychology, and cognitive science, is a cornerstone to advance the state-of-the-art in multimodal human-robot interaction and neuro-cognitive modeling. Research on neuro-cognitive models benefits from the embodiment of these models into physical, humanoid agents that possess complex, human-like sensorimotor capabilities for multimodal interaction with the real world. For this purpose, we develop and introduce NICO (Neuro-Inspired COmpanion), a humanoid developmental robot that fills a gap between necessary sensing and interaction capabilities and flexible design. This combination makes it a novel neuro-cognitive research platform for embodied sensorimotor computational and cognitive models in the context of multimodal interaction as shown in our results.
Robot facial expressions and gaze are important factors for enhancing human-robot interaction (HRI), but their effects on human collaboration and perception are not well understood, for instance, in collaborative game scenarios. In this... more
Robot facial expressions and gaze are important factors for enhancing human-robot interaction (HRI), but their effects on human collaboration and perception are not well understood, for instance, in collaborative game scenarios. In this study, we designed a collaborative triadic HRI game scenario where two participants worked together to insert objects into a shape sorter. One participant assumed the role of a guide. The guide instructed the other participant, who played the role of an actor, to place occluded objects into the sorter. A humanoid robot issued instructions, observed the interaction, and displayed social cues to elicit changes in the two participants' behavior. We measured human collaboration as a function of task completion time and the participants' perceptions of the robot by rating its behavior as intelligent or random. Participants also evaluated the robot by filling out the Godspeed questionnaire. We found that human collaboration was higher when the robot displayed a happy facial expression at the beginning of the game compared to a neutral facial expression. We also found that participants perceived the robot as more intelligent when it displayed a positive facial expression at the end of the game. The robot's behavior was also perceived as intelligent when directing its gaze toward the guide at the beginning of the interaction, not the actor. These findings provide insights into how robot facial expressions and gaze influence human behavior and perception in collaboration.
A desirable trait of an artificial agent acting in the visual world is to continually learn a sequence of language-informed tasks while striking a balance between sufficiently specializing in each task and building a generalized knowledge... more
A desirable trait of an artificial agent acting in the visual world is to continually learn a sequence of language-informed tasks while striking a balance between sufficiently specializing in each task and building a generalized knowledge for transfer. Selective specialization, i.e., a careful selection of model components to specialize in each task, is a strategy to provide control over this trade-off. However, the design of selection strategies requires insights on the role of each model component in learning rather specialized or generalizable representations, which poses a gap in current research. Thus, our aim with this work is to provide an extensive analysis of selection strategies for visually grounded continual language learning. Due to the lack of suitable benchmarks for this purpose, we introduce two novel diagnostic datasets that provide enough control and flexibility for a thorough model analysis. We assess various heuristics for module specialization strategies as well as quantifiable measures for two different types of model architectures. Finally, we design conceptually simple approaches based on our analysis that outperform common continual learning baselines. Our results demonstrate the need for further efforts towards better aligning continual learning algorithms with the learning behaviors of individual model parts.
In this paper, we review recent approaches for explaining concepts in neural networks. Concepts can act as a natural link between learning and reasoning: once the concepts are identified that a neural learning system uses, one can... more
In this paper, we review recent approaches for explaining concepts in neural networks. Concepts can act as a natural link between learning and reasoning: once the concepts are identified that a neural learning system uses, one can integrate those concepts with a reasoning system for inference or use a reasoning system to act upon them to improve or enhance the learning system. On the other hand, knowledge can not only be extracted from neural networks but concept knowledge can also be inserted into neural network architectures. Since integrating learning and reasoning is at the core of neuro-symbolic AI, the insights gained from this survey can serve as an important step towards realizing neuro-symbolic AI based on explainable concepts.
Model-based reinforcement learning (MBRL) with real-time planning has shown great potential in locomotion and manipulation control tasks. However, the existing planning methods, such as the Cross-Entropy Method (CEM), do not scale well to... more
Model-based reinforcement learning (MBRL) with real-time planning has shown great potential in locomotion and manipulation control tasks. However, the existing planning methods, such as the Cross-Entropy Method (CEM), do not scale well to complex high-dimensional environments. One of the key reasons for underperformance is the lack of exploration, as these planning methods only aim to maximize the cumulative extrinsic reward over the planning horizon. Furthermore, planning inside the compact latent space in the absence of observations makes it challenging to use curiosity-based intrinsic motivation. We propose Curiosity CEM (CCEM), an improved version of the CEM algorithm for encouraging exploration via curiosity. Our proposed method maximizes the sum of state-action Q values over the planning horizon, in which these Q values estimate the future extrinsic and intrinsic reward, hence encouraging to reach novel observations. In addition, our model uses contrastive representation learning to efficiently learn latent representations. Experiments on image-based continuous control tasks from the DeepMind Control suite show that CCEM is by a large margin more sample-efficient than previous MBRL algorithms and compares favorably with the best model-free RL methods.
Programming robot behavior in a complex world faces challenges on multiple levels, from dextrous low-level skills to high-level planning and reasoning. Recent pre-trained Large Language Models (LLMs) have shown remarkable reasoning... more
Programming robot behavior in a complex world faces challenges on multiple levels, from dextrous low-level skills to high-level planning and reasoning. Recent pre-trained Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning. However, it remains challenging to ground LLMs in multimodal sensory input and continuous action output, while enabling a robot to interact with its environment and acquire novel information as its policies unfold. We develop a robot interaction scenario with a partially observable state, which necessitates a robot to decide on a range of epistemic actions in order to sample sensory information among multiple modalities, before being able to execute the task correctly. Matcha (Multimodal environment chatting) agent, an interactive perception framework, is therefore proposed with an LLM as its backbone, whose ability is exploited to instruct epistemic actions and to reason over the resulting multimodal sensations (vision, sound, haptics, proprioception), as well as to plan an entire task execution based on the interactively acquired information. Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment, while multimodal modules with the context of the environmental state help ground the LLMs and extend their processing ability. The project website can be found at https://matcha-agent.github.io.
Neural fields are neural networks which map coordinates to a desired signal. When a neural field should jointly model multiple signals, and not memorize only one, it needs to be conditioned on a latent code which describes the signal at... more
Neural fields are neural networks which map coordinates to a desired signal. When a neural field should jointly model multiple signals, and not memorize only one, it needs to be conditioned on a latent code which describes the signal at hand. Despite being an important aspect, there has been little research on conditioning strategies for neural fields. In this work, we explore the use of neural fields as decoders for 2D semantic segmentation. For this task, we compare three conditioning methods, simple concatenation of the latent code, Feature-wise Linear Modulation (FiLM), and Cross-Attention, in conjunction with latent codes which either describe the full image or only a local region of the image. Our results show a considerable difference in performance between the examined conditioning strategies. Furthermore, we show that conditioning via Cross-Attention achieves the best results and is competitive with a CNNbased decoder for semantic segmentation.
The paper introduces CycleIK, a neuro-robotic approach that wraps two novel neuro-inspired methods for the inverse kinematics (IK) task-a Generative Adversarial Network (GAN), and a Multi-Layer Perceptron architecture. These methods can... more
The paper introduces CycleIK, a neuro-robotic approach that wraps two novel neuro-inspired methods for the inverse kinematics (IK) task-a Generative Adversarial Network (GAN), and a Multi-Layer Perceptron architecture. These methods can be used in a standalone fashion, but we also show how embedding these into a hybrid neuro-genetic IK pipeline allows for further optimization via sequential least-squares programming (SLSQP) or a genetic algorithm (GA). The models are trained and tested on dense datasets that were collected from random robot configurations of the new Neuro-Inspired COLlaborator (NICOL), a semi-humanoid robot with two redundant 8-DoF manipulators. We utilize the weighted multi-objective function from the state-of-the-art BioIK method to support the training process and our hybrid neurogenetic architecture. We show that the neural models can compete with state-of-the-art IK approaches, which allows for deployment directly to robotic hardware. Additionally, it is shown that the incorporation of the genetic algorithm improves the precision while simultaneously reducing the overall runtime.
Multimodal integration is a key component of allowing robots to perceive the world. Multimodality comes with multiple challenges that have to be considered, such as how to integrate and fuse the data. In this paper, we compare different... more
Multimodal integration is a key component of allowing robots to perceive the world. Multimodality comes with multiple challenges that have to be considered, such as how to integrate and fuse the data. In this paper, we compare different possibilities of fusing visual, tactile and proprioceptive data. The data is directly recorded on the NICOL robot in an experimental setup in which the robot has to classify containers and their content. Due to the different nature of the containers, the use of the modalities can wildly differ between the classes. We demonstrate the superiority of multimodal solutions in this use case and evaluate three fusion strategies that integrate the data at different time steps. We find that the accuracy of the best fusion strategy is 15% higher than the best strategy using only one singular sense.
While Automatic Speech Recognition (ASR) models have shown significant advances with the introduction of unsupervised or selfsupervised training techniques, these improvements are still only limited to a subsection of languages and... more
While Automatic Speech Recognition (ASR) models have shown significant advances with the introduction of unsupervised or selfsupervised training techniques, these improvements are still only limited to a subsection of languages and speakers. Transfer learning enables the adaptation of large-scale multilingual models to not only low-resource languages but also to more specific speaker groups. However, fine-tuning on data from new domains is usually accompanied by a decrease in performance on the original domain. Therefore, in our experiments, we examine how well the performance of large-scale ASR models can be approximated for smaller domains, with our own dataset of German Senior Voice Commands (SVC-de), and how much of the general speech recognition performance can be preserved by selectively freezing parts of the model during training. To further increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain, we apply Experience Replay [20] for continual learning. By adding only a fraction of data from the original domain, we are able to reach Word-Error-Rates (WERs) below 5% on the new domain, while stabilizing performance for general speech recognition at acceptable WERs.
In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic Speech Recognition (ASR) have reported state-of-the-art performance on various benchmarks. These systems intrinsically learn how to... more
In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic Speech Recognition (ASR) have reported state-of-the-art performance on various benchmarks. These systems intrinsically learn how to handle and remove noise conditions from speech. Previous research has shown, that it is possible to extract the denoising capabilities of these models into a preprocessor network, which can be used as a frontend for downstream ASR models. However, the proposed methods were limited to specific fully convolutional architectures. In this work, we propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture. We propose the Cleancoder preprocessor architecture that extracts hidden activations from the Conformer ASR model and feeds them to a decoder to predict denoised spectrograms. We train our preprocessor on the Noisy Speech Database (NSD) to reconstruct denoised spectrograms from noisy inputs. Then, we evaluate our model as a frontend to a pretrained Conformer ASR model as well as a frontend to train smaller Conformer ASR models from scratch. We show that the Cleancoder is able to filter noise from speech and that it improves the total Word Error Rate (WER) of the downstream model in noisy conditions for both applications.
Recent developments of sensors that allow tracking of human movements and gestures enable rapid progress of applications in domains like medical rehabilitation or robotic control. Especially the inertial measurement unit (IMU) is an... more
Recent developments of sensors that allow tracking of human movements and gestures enable rapid progress of applications in domains like medical rehabilitation or robotic control. Especially the inertial measurement unit (IMU) is an excellent device for real-time scenarios as it rapidly delivers data input. Therefore, a computational model must be able to learn gesture sequences in a fast yet robust way. We recently introduced an echo state network (ESN) framework for continuous gesture recognition (Tietz et al., 2019) including novel approaches for gesture spotting, i.e., the automatic detection of the start and end phase of a gesture. Although our results showed good classification performance, we identified significant factors which also negatively impact the performance like subgestures and gesture variability. To address these issues, we include experiments with Long Short-Term Memory (LSTM) networks, which is a state-of-the-art model for sequence processing, to compare the obtained results with our framework and to evaluate their robustness regarding pitfalls in the recognition process. In this study, we analyze the two conceptually different approaches processing continuous, variablelength gesture sequences, which shows interesting results comparing the distinct gesture accomplishments. In addition, our results demonstrate that our ESN framework achieves comparably good performance as the LSTM network but has significantly lower training times. We conclude from the present work that ESNs are viable models for continuous gesture recognition delivering reasonable performance for applications requiring real-time performance as in robotic or rehabilitation tasks. From our discussion of this comparative study, we suggest prospective improvements on both the experimental and network architecture level.
In recent years some researchers have explored the use of reinforcement learning (RL) algorithms as key components in the solution of various natural language processing (NLP) tasks. For instance, some of these algorithms leveraging deep... more
In recent years some researchers have explored the use of reinforcement learning (RL) algorithms as key components in the solution of various natural language processing (NLP) tasks. For instance, some of these algorithms leveraging deep neural learning have found their way into conversational systems. This paper reviews the state of the art of RL methods for their possible use for different problems of NLP, focusing primarily on conversational systems, mainly due to their growing relevance. We provide detailed descriptions of the problems as well as discussions of why RL is well-suited to solve them. Also, we analyze the advantages and limitations of these methods. Finally, we elaborate on promising research directions in NLP that might benefit from RL.
One-step reinforcement learning explanation methods account for individual actions but fail to consider the agent’s future behavior, which can make their interpretation ambiguous. We propose to address this limitation by providing... more
One-step reinforcement learning explanation methods account for individual actions but fail to consider the agent’s future behavior, which can make their interpretation ambiguous. We propose to address this limitation by providing hierarchical goals as context for one-step explanations. By considering the current hierarchical goal as a context, one-step explanations can be interpreted with higher certainty, as the agent’s future behavior is more predictable. We combine reward decomposition with hierarchical reinforcement learning into a novel explainable reinforcement learning framework, which yields more interpretable, goal-contextualized one-step explanations. With a qualitative analysis of one-step reward decomposition explanations, we first show that their interpretability is indeed limited in scenarios with multiple, different optimal policies—a characteristic shared by other one-step explanation methods. Then, we show that our framework retains high interpretability in such ca...
Whenever we are addressing a specific object or refer to a certain spatial location, we are using referential or deictic gestures usually accompanied by some verbal description. Particularly, pointing gestures are necessary to dissolve... more
Whenever we are addressing a specific object or refer to a certain spatial location, we are using referential or deictic gestures usually accompanied by some verbal description. Particularly, pointing gestures are necessary to dissolve ambiguities in a scene and they are of crucial importance when verbal communication may fail due to environmental conditions or when two persons simply do not speak the same language. With the currently increasing advances of humanoid robots and their future integration in domestic domains, the development of gesture interfaces complementing human–robot interaction scenarios is of substantial interest. The implementation of an intuitive gesture scenario is still challenging because both the pointing intention and the corresponding object have to be correctly recognized in real time. The demand increases when considering pointing gestures in a cluttered environment, as is the case in households. Also, humans perform pointing in many different ways and ...
Most learning algorithms require the practitioner to manually set the values of many hyperparameters before the learning process can begin. However, with modern algorithms, the evaluation of a given hyperparameter setting can take a... more
Most learning algorithms require the practitioner to manually set the values of many hyperparameters before the learning process can begin. However, with modern algorithms, the evaluation of a given hyperparameter setting can take a considerable amount of time and the search space is often very high-dimensional. We suggest using a lower-dimensional representation of the original data to quickly identify promising areas in the hyperparameter space. This information can then be used to initialize the optimization algorithm for the original, higher-dimensional data. We compare this approach with the standard procedure of optimizing the hyperparameters only on the original input. We perform experiments with various state-of-the-art hyperparameter optimization algorithms such as random search, the tree of parzen estimators (TPEs), sequential model-based algorithm configuration (SMAC), and a genetic algorithm (GA). Our experiments indicate that it is possible to speed up the optimization ...
In this paper, we present a new intrinsically motivated actor-critic algorithm for learning continuous motor skills directly from raw visual input. Our neural architecture is composed of a critic and an actor network. Both networks... more
In this paper, we present a new intrinsically motivated actor-critic algorithm for learning continuous motor skills directly from raw visual input. Our neural architecture is composed of a critic and an actor network. Both networks receive the hidden representation of a deep convolutional autoencoder which is trained to reconstruct the visual input, while the centre-most hidden representation is also optimized to estimate the state value. Separately, an ensemble of predictive world models generates, based on its learning progress, an intrinsic reward signal which is combined with the extrinsic reward to guide the exploration of the actor-critic learner. Our approach is more data-efficient and inherently more stable than the existing actor-critic methods for continuous control from pixel data. We evaluate our algorithm for the task of learning robotic reaching and grasping skills on a realistic physics simulator and on a humanoid robot. The results show that the control policies lear...
Spoken language is one of the most efficientways to instruct robots about performing domestic tasks. However, the state of the environment has to be considered to plan and execute actions successfully. We propose a system that learns to... more
Spoken language is one of the most efficientways to instruct robots about performing domestic tasks. However, the state of the environment has to be considered to plan and execute actions successfully. We propose a system that learns to recognise the user’s intention and map it to a goal. A reinforcement learning (RL) system then generates a sequence of actions toward this goal considering the state of the environment. A novel contribution in this paper is the use of symbolic representations for both input and output of a neural Deep Q-network (DQN), which enables it to be used in a hybrid system. To show the effectiveness of our approach, the Tell-Me-Dave corpus is used to train an intention detection model and in a second step an RL agent generates the sequences of actions towards the detected objective, represented by a set of state predicates. We show that the system can successfully recognise command sequences fromthis corpus aswell as train the deep- RL network with symbolic i...
ABSTRACT
Research Interests:
Abstract Classical fear conditioning has experienced a growing interest over the last decade. Fear learning mechanisms are a simple and robust learning paradigm that involves sensory and motor areas. We believe that a deeper study of... more
Abstract Classical fear conditioning has experienced a growing interest over the last decade. Fear learning mechanisms are a simple and robust learning paradigm that involves sensory and motor areas. We believe that a deeper study of these mechanisms will contribute not only to a better understanding of fear conditioning but also to the development of future robot genera-tions. 2. System Overview Here, we present a biologically motivated model of auditory cue conditioning. The model includes the known thalamic and auditory cor-tex routes plus reward learning based on phasic dynamics of dopamine. We aim to develop a biologically plausible ar-chitecture able to run on a real humanoid robot. Our architecture is based on a reward predic-tion error model presented by Lowe, et al. 2009 [3]. We included three bio-plausible pathways to the amygdala, including the auditory tha-lamus, auditory cortex and prefrontal cortex (PFC). Applications of this learning mechanism may be used in artificia...
The visual recognition of complex, articulated human movements is fundamental for a wide range of artificial systems oriented toward human-robot communication, action classification, and action-driven perception. These challenging tasks... more
The visual recognition of complex, articulated human movements is fundamental for a wide range of artificial systems oriented toward human-robot communication, action classification, and action-driven perception. These challenging tasks may generally involve the processing of a huge amount of visual information and learning-based mechanisms for generalizing a set of training actions and classifying new samples. To operate in natural environments, a crucial property is the efficient and robust recognition of actions, also under noisy conditions caused by, for instance, systematic sensor errors and temporarily occluded persons. Studies of the mammalian visual system and its outperforming ability to process biological motion information suggest separate neural pathways for the distinct processing of pose and motion features at multiple levels and the subsequent integration of these visual cues for action perception. We present a neurobiologically-motivated approach to achieve noise-tolerant action recognition in real time. Our model consists of self-organizing Growing When Required (GWR) networks that obtain progressively generalized representations of sensory inputs and learn inherent spatio-temporal dependencies. During the training, the GWR networks dynamically change their topological structure to better match the input space. We first extract pose and motion features from video sequences and then cluster actions in terms of prototypical pose-motion trajectories. Multi-cue trajectories from matching action frames are subsequently combined to provide action dynamics in the joint feature space. Reported experiments show that our approach outperforms previous results on a dataset of full-body actions captured with a depth sensor, and ranks among the best results for a public benchmark of domestic daily actions.

And 469 more

Saliency prediction refers to the computational task of modeling overt attention. Social cues greatly influence our attention, consequently altering our eye movements and behavior. To emphasize the efficacy of such features, we present a... more
Saliency prediction refers to the computational task of modeling overt attention. Social cues greatly influence our attention, consequently altering our eye movements and behavior. To emphasize the efficacy of such features, we present a neural model for integrating social cues and weighting their influences. Our model consists of two stages. During the first stage, we detect two social cues by following gaze, estimating gaze direction, and recognizing affect. These features are then transformed into spatiotemporal maps through image processing operations. The transformed representations are propagated to the second stage (GASP), where we explore various techniques of late fusion for integrating social cues and introduce two sub-networks for directing attention to relevant stimuli. Our experiments indicate that fusion approaches achieve better results for static integration methods, whereas non-fusion approaches for which the influence of each modality is unknown result in better outcomes when coupled with recurrent models for dynamic saliency prediction. We show that gaze direction and affective representations contribute a prediction to ground-truth correspondence improvement of at least 5% compared to dynamic saliency models without social cues. Furthermore, affective representations improve GASP, supporting the necessity of considering affect-biased attention in predicting saliency. 1
In this work, we propose a novel training scheme to modularize end-to-end systems. Our training scheme aims at altering the flow of information in an end-to-end system to use the kernels of this system for another system that fulfills... more
In this work, we propose a novel training scheme to modularize end-to-end systems. Our training scheme aims at altering the flow of information in an end-to-end system to use the kernels of this system for another system that fulfills another task. We apply this scheme to extract the noise reduction capabilities from a noise-robust automatic speech recognition (ASR) system and implement a speech enhancer from it. This enhancer receives spectral representations from unfiltered audio and outputs cleaned spectral representations. Our enhancer can be integrated into an ASR system as front-end, is trainable, and reduces background noise. Our front-end uses a decoder to clean speech based on the hidden activations of the ASR system Jasper. While training, we exclusively adapt the weights in our decoder and the batch normalization in Jasper. The resulting spectral representations show less background noise. Further, areas in the spectral features are not reconstructed if they do not contribute to speech recognition. We demonstrate that our front-end can be combined with a pre-trained ASR system as back-end and supports speech recognition in noisy conditions. Further, we show that training another ASR system with our front-end results in an increased performance of the ASR system in noisy as well as noiseless conditions. The ASR system's performance is especially improved on challenging speech datasets.
Neural networks can be powerful function approximators, which are able to model high-dimensional feature distributions from a subset of examples drawn from the target distribution. Naturally, they perform well at generalizing within the... more
Neural networks can be powerful function approximators, which are able to model high-dimensional feature distributions from a subset of examples drawn from the target distribution. Naturally, they perform well at generalizing within the limits of their target function, but they often fail to generalize outside of the explicitly learned feature space. It is therefore an open research topic whether and how neural network-based architectures can be deployed for systematic reasoning. Many studies have shown evidence for poor generalization, but they often work with abstract data or are limited to single-channel input. Humans, however, learn and interact through a combination of multiple sensory modalities, and rarely rely on just one. To investigate compositional generalization in a multimodal setting, we generate an extensible dataset with multimodal input sequences from simulation. We investigate the influence of the underlying training data distribution on compostional generalization in a minimal LSTM-based network trained in a supervised, time continuous setting. We find compositional generalization to fail in simple setups while improving with the number of objects, actions, and particularly with a lot of color overlaps between objects. Furthermore, multimodality strongly improves compositional generalization in settings where a pure vision model struggles to generalize.
Model-free reinforcement learning algorithms can learn an optimal policy from experience without requiring prior knowledge. However, model-free agents require vast amounts of samples, particularly in sparse reward environments where most... more
Model-free reinforcement learning algorithms can learn an optimal policy from experience without requiring prior knowledge. However, model-free agents require vast amounts of samples, particularly in sparse reward environments where most states contain zero rewards. We developed a model-based approach to tackle the high sample complexity problem in sparse reward settings with continuous actions. A trained world model is queried by a particle swarm optimization (PSO) planner and employed as the action selection mechanism, hence taking the role of the actor in an actor-critic architecture. Parameters of the PSO regulate the agent's exploration rate. We show that the planner aids the agent to discover rewards even in regions with zero value gradient. Our simple planning integrated policy architecture learns more efficiently with fewer samples than continuous model-free algorithms.
Recently, a generative variational autoencoder (VAE) has been proposed for speech enhancement to model speech statistics. However, this approach only uses clean speech in the training phase, making the estimation particularly sensitive... more
Recently, a generative variational autoencoder (VAE) has been proposed for speech enhancement to model speech statistics. However, this approach only uses clean speech in the training phase, making the estimation particularly sensitive to noise presence, especially in low signal-to-noise ratios (SNRs). To increase the robustness of the VAE, we propose to include noise information in the training phase by using a noise-aware encoder trained on noisy-clean speech pairs. We evaluate our approach on real recordings of different noisy environments and acoustic conditions using two different noise datasets. We show that our proposed noise-aware VAE outperforms the standard VAE in terms of overall distortion without increasing the number of model parameters. At the same time, we demonstrate that our model is capable of generalizing to unseen noise conditions better than a supervised feedforward deep neural network (DNN). Furthermore, we demonstrate the robustness of the model performance to a reduction of the noisy-clean speech training data size.
Recently there has been an interest in the potential of learning generative models from a single image, as opposed to from a large dataset. This task is of significance, as it means that generative models can be used in domains where... more
Recently there has been an interest in the potential of learning generative models from a single image, as opposed to from a large dataset. This task is of significance, as it means that generative models can be used in domains where collecting a large dataset is not feasible. However, training a model capable of generating realistic images from only a single sample is a difficult problem. In this work, we conduct a number of experiments to understand the challenges of training these methods and propose some best practices that we found allowed us to generate improved results over previous work. One key piece is that, unlike prior single image generation methods, we concurrently train several stages in a sequential multi-stage manner, allowing us to learn models with fewer stages of increasing image resolution. Compared to a recent state of the art baseline, our model is up to six times faster to train, has fewer parameters, and can better capture the global structure of images.
This video shows a friendly human-robot interaction using humanoid Nao robots. The speaker teaches the robot some names of objects using speech. This work shows the successful integration of three different projects mainly using... more
This video shows a friendly human-robot interaction using humanoid Nao robots. The speaker teaches the robot some names of objects using speech. This work shows the successful integration of three different projects mainly using Artificial Neural Networks: (1) object recognition with RGB-D (color and depth) sensor, (2) speech to text using an approach that post-processes Google's speech recognition hypotheses, and (3) syntactic interpretation of sentences.
In the Knowledge Technology group (https://www.informatik.uni-hamburg.de/wtm/ ) we develop neural network models with a particular focus on robot interaction and development. One of our goals is to better understand grounded communication... more
In the Knowledge Technology group (https://www.informatik.uni-hamburg.de/wtm/ ) we develop neural network models with a particular focus on robot interaction and development. One of our goals is to better understand grounded communication in humans and machines. Another goal is to use the knowledge we gain to improve multisensory integration and social interaction in humanoid robots. In the context of understanding grounded communication, we will present a deep neural network model for emotion expression recognition for human-robot interaction. Furthermore, we will present a model using self-organizing neural networks for human-action recognition in the context of human-robot assistance. In conclusion, we show how emotion perception and multisensory observation of human actions can contribute to future human-robot cooperation.
Our iCub is lonely, and wants to talk to someone. But it will only talk to someone who has a positive expression towards him. To recognize visual emotion expressions we use a deep neural network which observes a person's facial expression... more
Our iCub is lonely, and wants to talk to someone. But it will only talk to someone who has a positive expression towards him. To recognize visual emotion expressions we use a deep neural network which observes a person's facial expression and body movement. Our iCub is able to classify expressions in three categories: positive, negative and neutral. After starting a conversation, the iCub will react differently depending on which expression the person shows. The iCub will tease the person, be sad, or angry and will express itself using its
facial features and colours.
The purpose of this Research Topic is to reflect and discuss links between neuroscience, psychology, computer science and robotics with regards to the topic of cross-modal learning which has, in recent years, emerged as a new area of... more
The purpose of this Research Topic is to reflect and discuss links between neuroscience, psychology, computer science and robotics with regards to the topic of cross-modal learning which has, in recent years, emerged as a new area of interdisciplinary research. The term cross-modal learning refers to the synergistic synthesis of information from multiple sensory modalities such that the learning that occurs within any individual sensory modality can be enhanced with information from one or more other modalities. Cross-modal learning is a crucial component of adaptive behavior in a continuously changing world, and examples are ubiquitous, such as: learning to grasp and manipulate objects; learning to walk; learning to read and write; learning to understand language and its referents; etc. In all these examples, visual, auditory, somatosensory or other modalities have to be integrated, and learning must be cross-modal. In fact, the broad range of acquired human skills are cross-modal, and many of the most advanced human capabilities, such as those involved in social cognition, require learning from the richest combinations of cross-modal information. In contrast, even the very best systems in Artificial Intelligence (AI) and robotics have taken only tiny steps in this direction. Building a system that composes a global perspective from multiple distinct sources, types of data, and sensory modalities is a grand challenge of AI, yet it is specific enough that it can be studied quite rigorously and in such detail that the prospect for deep insights into these mechanisms is quite plausible in the near term. Cross-modal learning is a broad, interdisciplinary topic that has not yet coalesced into a single, unified field. Instead, there are many separate fields, each tackling the concerns of cross-modal learning from its own perspective, with currently little overlap. We anticipate an accelerating trend towards integration of these areas and we intend to contribute to that integration. By focusing on cross-modal learning, the proposed Research Topic can bring together recent progress in artificial intelligence, robotics, psychology and neuroscience.
The proceedings set LNCS 12891, LNCS 12892, LNCS 12893, LNCS 12894 and LNCS 12895 constitute the proceedings of the 30th International Conference on Artificial Neural Networks, ICANN 2021, held in Bratislava, Slovakia, in September... more
The proceedings set LNCS 12891, LNCS 12892, LNCS 12893, LNCS 12894 and LNCS 12895 constitute the proceedings of the 30th International Conference on Artificial Neural Networks, ICANN 2021, held in Bratislava, Slovakia, in September 2021.* The total of 265 full papers presented in these proceedings was carefully reviewed and selected from 496 submissions, and organized in 5 volumes. In this volume, the papers focus on topics such as adversarial machine learning, anomaly detection, attention and transformers, audio and multimodal applications, bioinformatics and biosignal analysis, capsule networks and cognitive models. *The conference was held online 2021 due to the COVID-19 pandemic.