Companion Publication of the 2020 International Conference on Multimodal Interaction, 2020
This paper investigates different fusion strategies as well as provides insights on their effecti... more This paper investigates different fusion strategies as well as provides insights on their effectiveness alongside standalone classifiers in the framework of paralinguistic analysis of infant vocalizations. The combinations of such systems as Support Vector Machines (SVM) and Extreme Learning Machines (ELM) based classifiers, as well as its weighted kernel version are explored, training systems on different acoustic feature representations and implementing weighted score-level fusion of the predictions. The proposed framework is tested on INTERSPEECH ComParE-2019 Baby Sounds corpus, which is a collection of Home Bank infant vocalization corpora annotated for five classes. Adhering to the challenge protocol, using a single test set submission we outperform the challenge baseline Unweighted Average Recall (UAR) score and achieve a comparable result to the state-of-the-art.
The paper considers the results of the ISS-41/42 crew's activity aboard the spacecraft “Soyuz... more The paper considers the results of the ISS-41/42 crew's activity aboard the spacecraft “SoyuzTMA-14M” and International Space Station. Also, it contains the comparative analysis and estimation of the crew’s contribution to the overall flight program of the ISS. Particular attention is paid to the implementation of scientific applied research and experiments aboard the station. Comments and suggestions on upgrading the ISS Russian Segment are given.
The use of robotic systems (RSs) in future manned space missions requires the creation of the cos... more The use of robotic systems (RSs) in future manned space missions requires the creation of the cosmonaut-researcher a holistic view on the forms of interaction within the “human – robot” system (HRS) under the adverse environmental conditions. For these purposes, educational and reference materials (ERMs) are needed in fields of ergonomics and its representation in the design of human-machine interfaces (HMI). The paper considers the application of the ontological approach in the actual subject area – the ergonomics of the HMI, as the way of interdisciplinary integration various scientific fields – Informatics, ergonomics, psychophysiology, etc.
This paper presents an analysis of datasets of images of human faces with annotated facial keypoi... more This paper presents an analysis of datasets of images of human faces with annotated facial keypoints, which are important in human-machine interaction, and their comparison. Datasets are divided according to external conditions of the subject into two groups: datasets in laboratory conditions and in the wild data. Moreover, a quick review of the state-of-the-art methods for keypoints detection is provided. Existing methods are categorized into the following three groups according to the approach to the solution of the problem: top-down, bottom-up and their combination.
In this paper, we present a novel bimodal speech recognition technique that fuses both audio info... more In this paper, we present a novel bimodal speech recognition technique that fuses both audio information sound signal and visual information movements of lips for Russian speech recognition. We propose an architecture of the automatic system for bimodal recognition of audio-visual speech, which uses one stationary microphone Oktava and one high-speed camera JAI Pulnix 200 frames per second at 640i¾?×i¾?480 pixels to get audio and video signals. We describe also developed software for audio-visual speech database recording, phonemic and visemic structures of the Russian language, as well as probabilistic models of bimodal speech units based on Coupled Hidden Markov Models. Realization of a transformation method from a Coupled Hidden Markov Model into an equivalent 2-stream Hidden Markov Model is presented as well.
Proceedings - International Conference on Pattern Recognition, 2010
... Alexey Karpov, Andrey Ronzhin, Irina Kipyatkova, Alexander Ronzhin St. ... Most important of ... more ... Alexey Karpov, Andrey Ronzhin, Irina Kipyatkova, Alexander Ronzhin St. ... Most important of these modules are: (1) video processing with two non-stereo video-cameras and a technology of computer vision in order to detect the human's position, face and some facial organs; (2 ...
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2010
Web-based collaboration using the wireless devices that have multimedia playback capabilities is ... more Web-based collaboration using the wireless devices that have multimedia playback capabilities is a viable alternative to traditional face-to-face meetings. E-meetings are popular in businesses because of their cost savings. To provide quick and effective engagement to the meeting activity, the remote user should be able to perceive whole events in the meeting room and have the same possibilities like participants
International Conference on Signal Processing Proceedings, ICSP, 2012
In this paper, we present a research on designing and processing an audio-visual speech database ... more In this paper, we present a research on designing and processing an audio-visual speech database for an automatic Russian speech recognition system using Oktava MK-012 microphone and JAI Pulnix RMC-6740GE high-speed camera (200 frames per second). Developed audio-visual speech recording system is described, it provides synchronization and fusion of audio and video data recorded by the independent sensors. The system automatically detects voice activity in audio signal and stores only speech fragments discarding non-informative signals. Also it takes into account and processes natural asynchrony of both speech modalities. Methods for feature extraction of acoustic (based on Mel-frequency cepstral coefficients) and visual speech (pixel-based features of mouth region) and multimodal data temporal segmentation (by forced alignment) are presented.
Client and Speech Detection System for Intelligent Infokiosk Andrey Ronzhin1, Alexey Karpov1,Irin... more Client and Speech Detection System for Intelligent Infokiosk Andrey Ronzhin1, Alexey Karpov1,Irina Kipyatkova1, and Milo elezný2 1 St. ... One portable web-camera Logitech QuickCam Pro is placed above and another below the touchscreen (non stereo-pair of the cameras). ...
Companion Publication of the 2020 International Conference on Multimodal Interaction, 2020
This paper investigates different fusion strategies as well as provides insights on their effecti... more This paper investigates different fusion strategies as well as provides insights on their effectiveness alongside standalone classifiers in the framework of paralinguistic analysis of infant vocalizations. The combinations of such systems as Support Vector Machines (SVM) and Extreme Learning Machines (ELM) based classifiers, as well as its weighted kernel version are explored, training systems on different acoustic feature representations and implementing weighted score-level fusion of the predictions. The proposed framework is tested on INTERSPEECH ComParE-2019 Baby Sounds corpus, which is a collection of Home Bank infant vocalization corpora annotated for five classes. Adhering to the challenge protocol, using a single test set submission we outperform the challenge baseline Unweighted Average Recall (UAR) score and achieve a comparable result to the state-of-the-art.
The paper considers the results of the ISS-41/42 crew's activity aboard the spacecraft “Soyuz... more The paper considers the results of the ISS-41/42 crew's activity aboard the spacecraft “SoyuzTMA-14M” and International Space Station. Also, it contains the comparative analysis and estimation of the crew’s contribution to the overall flight program of the ISS. Particular attention is paid to the implementation of scientific applied research and experiments aboard the station. Comments and suggestions on upgrading the ISS Russian Segment are given.
The use of robotic systems (RSs) in future manned space missions requires the creation of the cos... more The use of robotic systems (RSs) in future manned space missions requires the creation of the cosmonaut-researcher a holistic view on the forms of interaction within the “human – robot” system (HRS) under the adverse environmental conditions. For these purposes, educational and reference materials (ERMs) are needed in fields of ergonomics and its representation in the design of human-machine interfaces (HMI). The paper considers the application of the ontological approach in the actual subject area – the ergonomics of the HMI, as the way of interdisciplinary integration various scientific fields – Informatics, ergonomics, psychophysiology, etc.
This paper presents an analysis of datasets of images of human faces with annotated facial keypoi... more This paper presents an analysis of datasets of images of human faces with annotated facial keypoints, which are important in human-machine interaction, and their comparison. Datasets are divided according to external conditions of the subject into two groups: datasets in laboratory conditions and in the wild data. Moreover, a quick review of the state-of-the-art methods for keypoints detection is provided. Existing methods are categorized into the following three groups according to the approach to the solution of the problem: top-down, bottom-up and their combination.
In this paper, we present a novel bimodal speech recognition technique that fuses both audio info... more In this paper, we present a novel bimodal speech recognition technique that fuses both audio information sound signal and visual information movements of lips for Russian speech recognition. We propose an architecture of the automatic system for bimodal recognition of audio-visual speech, which uses one stationary microphone Oktava and one high-speed camera JAI Pulnix 200 frames per second at 640i¾?×i¾?480 pixels to get audio and video signals. We describe also developed software for audio-visual speech database recording, phonemic and visemic structures of the Russian language, as well as probabilistic models of bimodal speech units based on Coupled Hidden Markov Models. Realization of a transformation method from a Coupled Hidden Markov Model into an equivalent 2-stream Hidden Markov Model is presented as well.
Proceedings - International Conference on Pattern Recognition, 2010
... Alexey Karpov, Andrey Ronzhin, Irina Kipyatkova, Alexander Ronzhin St. ... Most important of ... more ... Alexey Karpov, Andrey Ronzhin, Irina Kipyatkova, Alexander Ronzhin St. ... Most important of these modules are: (1) video processing with two non-stereo video-cameras and a technology of computer vision in order to detect the human's position, face and some facial organs; (2 ...
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2010
Web-based collaboration using the wireless devices that have multimedia playback capabilities is ... more Web-based collaboration using the wireless devices that have multimedia playback capabilities is a viable alternative to traditional face-to-face meetings. E-meetings are popular in businesses because of their cost savings. To provide quick and effective engagement to the meeting activity, the remote user should be able to perceive whole events in the meeting room and have the same possibilities like participants
International Conference on Signal Processing Proceedings, ICSP, 2012
In this paper, we present a research on designing and processing an audio-visual speech database ... more In this paper, we present a research on designing and processing an audio-visual speech database for an automatic Russian speech recognition system using Oktava MK-012 microphone and JAI Pulnix RMC-6740GE high-speed camera (200 frames per second). Developed audio-visual speech recording system is described, it provides synchronization and fusion of audio and video data recorded by the independent sensors. The system automatically detects voice activity in audio signal and stores only speech fragments discarding non-informative signals. Also it takes into account and processes natural asynchrony of both speech modalities. Methods for feature extraction of acoustic (based on Mel-frequency cepstral coefficients) and visual speech (pixel-based features of mouth region) and multimodal data temporal segmentation (by forced alignment) are presented.
Client and Speech Detection System for Intelligent Infokiosk Andrey Ronzhin1, Alexey Karpov1,Irin... more Client and Speech Detection System for Intelligent Infokiosk Andrey Ronzhin1, Alexey Karpov1,Irina Kipyatkova1, and Milo elezný2 1 St. ... One portable web-camera Logitech QuickCam Pro is placed above and another below the touchscreen (non stereo-pair of the cameras). ...
Uploads
Papers by Alexey Karpov