-
Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix Factorization via Plastic Transformer
Authors:
Xiaofeng Liu,
Fangxu Xing,
Maureen Stone,
Jiachen Zhuo,
Sidney Fels,
Jerry L. Prince,
Georges El Fakhri,
Jonghye Woo
Abstract:
The tongue's intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through…
▽ More
The tongue's intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through motion features, yielding a set of building blocks and a corresponding weighting map. Investigating the link between weighting maps and speech acoustics can offer significant insights into the intricate process of speech production. To this end, in this work, we utilize two-dimensional spectrograms as a proxy representation, and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms. Our proposed plastic light transformer (PLT) framework is based on directional product relative position bias and single-level spatial pyramid pooling, thus enabling flexible processing of weighting maps with variable size to fixed-size spectrograms, without input information loss or dimension expansion. Additionally, our PLT framework efficiently models the global correlation of wide matrix input. To improve the realism of our generated spectrograms with relatively limited training samples, we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training. Experimental results on a dataset of 29 subjects speaking two utterances demonstrated that our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.
△ Less
Submitted 25 September, 2023;
originally announced September 2023.
-
Scaling Neural Face Synthesis to High FPS and Low Latency by Neural Caching
Authors:
Frank Yu,
Sid Fels,
Helge Rhodin
Abstract:
Recent neural rendering approaches greatly improve image quality, reaching near photorealism. However, the underlying neural networks have high runtime, precluding telepresence and virtual reality applications that require high resolution at low latency. The sequential dependency of layers in deep networks makes their optimization difficult. We break this dependency by caching information from the…
▽ More
Recent neural rendering approaches greatly improve image quality, reaching near photorealism. However, the underlying neural networks have high runtime, precluding telepresence and virtual reality applications that require high resolution at low latency. The sequential dependency of layers in deep networks makes their optimization difficult. We break this dependency by caching information from the previous frame to speed up the processing of the current one with an implicit warp. The warping with a shallow network reduces latency and the caching operations can further be parallelized to improve the frame rate. In contrast to existing temporal neural networks, ours is tailored for the task of rendering novel views of faces by conditioning on the change of the underlying surface mesh. We test the approach on view-dependent rendering of 3D portrait avatars, as needed for telepresence, on established benchmark sequences. Warping reduces latency by 70$\%$ (from 49.4ms to 14.9ms on commodity GPUs) and scales frame rates accordingly over multiple GPUs while reducing image quality by only 1$\%$, making it suitable as part of end-to-end view-dependent 3D teleconferencing applications. Our project page can be found at: https://yu-frank.github.io/lowlatency/.
△ Less
Submitted 10 November, 2022;
originally announced November 2022.
-
A comparative study of two-dimensional vocal tract acoustic modeling based on Finite-Difference Time-Domain methods
Authors:
Debasish Ray Mohapatra,
Victor Zappi,
Sidney Fels
Abstract:
The two-dimensional (2D) numerical approaches for vocal tract (VT) modelling can afford a better balance between the low computational cost and accurate rendering of acoustic wave propagation. However, they require a high spatio-temporal resolution in the numerical scheme for a precise estimation of acoustic formants at the simulation run-time expense. We have recently proposed a new VT acoustic m…
▽ More
The two-dimensional (2D) numerical approaches for vocal tract (VT) modelling can afford a better balance between the low computational cost and accurate rendering of acoustic wave propagation. However, they require a high spatio-temporal resolution in the numerical scheme for a precise estimation of acoustic formants at the simulation run-time expense. We have recently proposed a new VT acoustic modelling technique, known as the 2.5D Finite-Difference Time-Domain (2.5D FDTD), which extends the existing 2D FDTD approach by adding tube depth to its acoustic wave solver. In this work, first, the simulated acoustic outputs of our new model are shown to be comparable with the 2D FDTD and a realistic 3D FEM VT model at a low spatio-temporal resolution. Next, a radiation model is developed by including a circular baffle around the VT as head geometry. The transfer functions of the radiation model are analyzed using five different vocal tract shapes for vowel sounds /a/, /e/, /i/, /o/ and /u/.
△ Less
Submitted 8 February, 2021;
originally announced February 2021.
-
SPEAK WITH YOUR HANDS Using Continuous Hand Gestures to control Articulatory Speech Synthesizer
Authors:
Pramit Saha,
Debasish Ray Mohapatra,
Sidney Fels
Abstract:
This work presents our advancements in controlling an articulatory speech synthesis engine, \textit{viz.}, Pink Trombone, with hand gestures. Our interface translates continuous finger movements and wrist flexion into continuous speech using vocal tract area-function based articulatory speech synthesis. We use Cyberglove II with 18 sensors to capture the kinematic information of the wrist and the…
▽ More
This work presents our advancements in controlling an articulatory speech synthesis engine, \textit{viz.}, Pink Trombone, with hand gestures. Our interface translates continuous finger movements and wrist flexion into continuous speech using vocal tract area-function based articulatory speech synthesis. We use Cyberglove II with 18 sensors to capture the kinematic information of the wrist and the individual fingers, in order to control a virtual tongue. The coordinates and the bending values of the sensors are then utilized to fit a spline tongue model that smoothens out the noisy values and outliers. Considering the upper palate as fixed and the spline model as the dynamically moving lower surface (tongue) of the vocal tract, we compute 1D area functional values that are fed to the Pink Trombone, generating continuous speech sounds. Therefore, by learning to manipulate one's wrist and fingers, one can learn to produce speech sounds just through one's hands, without the need for using the vocal tract.
△ Less
Submitted 2 February, 2021;
originally announced February 2021.
-
New interfaces for musical expression
Authors:
Ivan Poupyrev,
Michael J. Lyons,
Sidney Fels,
Tina Blaine
Abstract:
The rapid evolution of electronics, digital media, advanced materials, and other areas of technology, is opening up unprecedented opportunities for musical interface inventors and designers. The possibilities afforded by these new technologies carry with them the challenges of a complex and often confusing array of choices for musical composers and performers. New musical technologies are at least…
▽ More
The rapid evolution of electronics, digital media, advanced materials, and other areas of technology, is opening up unprecedented opportunities for musical interface inventors and designers. The possibilities afforded by these new technologies carry with them the challenges of a complex and often confusing array of choices for musical composers and performers. New musical technologies are at least partly responsible for the current explosion of new musical forms, some of which are controversial and challenge traditional definitions of music. Alternative musical controllers, currently the leading edge of the ongoing dialogue between technology and musical culture, involve many of the issues covered at past CHI meetings. This workshop brings together interface experts interested in musical controllers and musicians and composers involved in the development of new musical interfaces.
△ Less
Submitted 27 October, 2020;
originally announced October 2020.
-
Ultra2Speech -- A Deep Learning Framework for Formant Frequency Estimation and Tracking from Ultrasound Tongue Images
Authors:
Pramit Saha,
Yadong Liu,
Bryan Gick,
Sidney Fels
Abstract:
Thousands of individuals need surgical removal of their larynx due to critical diseases every year and therefore, require an alternative form of communication to articulate speech sounds after the loss of their voice box. This work addresses the articulatory-to-acoustic mapping problem based on ultrasound (US) tongue images for the development of a silent-speech interface (SSI) that can provide th…
▽ More
Thousands of individuals need surgical removal of their larynx due to critical diseases every year and therefore, require an alternative form of communication to articulate speech sounds after the loss of their voice box. This work addresses the articulatory-to-acoustic mapping problem based on ultrasound (US) tongue images for the development of a silent-speech interface (SSI) that can provide them with an assistance in their daily interactions. Our approach targets automatically extracting tongue movement information by selecting an optimal feature set from US images and mapping these features to the acoustic space. We use a novel deep learning architecture to map US tongue images from the US probe placed beneath a subject's chin to formants that we call, Ultrasound2Formant (U2F) Net. It uses hybrid spatio-temporal 3D convolutions followed by feature shuffling, for the estimation and tracking of vowel formants from US images. The formant values are then utilized to synthesize continuous time-varying vowel trajectories, via Klatt Synthesizer. Our best model achieves R-squared (R^2) measure of 99.96% for the regression task. Our network lays the foundation for an SSI as it successfully tracks the tongue contour automatically as an internal representation without any explicit annotation.
△ Less
Submitted 29 June, 2020;
originally announced June 2020.
-
Learning Joint Articulatory-Acoustic Representations with Normalizing Flows
Authors:
Pramit Saha,
Sidney Fels
Abstract:
The articulatory geometric configurations of the vocal tract and the acoustic properties of the resultant speech sound are considered to have a strong causal relationship. This paper aims at finding a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models, while simultaneously preserving the respective domain-specific features…
▽ More
The articulatory geometric configurations of the vocal tract and the acoustic properties of the resultant speech sound are considered to have a strong causal relationship. This paper aims at finding a joint latent representation between the articulatory and acoustic domain for vowel sounds via invertible neural network models, while simultaneously preserving the respective domain-specific features. Our model utilizes a convolutional autoencoder architecture and normalizing flow-based models to allow both forward and inverse mappings in a semi-supervised manner, between the mid-sagittal vocal tract geometry of a two degrees-of-freedom articulatory synthesizer with 1D acoustic wave model and the Mel-spectrogram representation of the synthesized speech sounds. Our approach achieves satisfactory performance in achieving both articulatory-to-acoustic as well as acoustic-to-articulatory mapping, thereby demonstrating our success in achieving a joint encoding of both the domains.
△ Less
Submitted 30 September, 2020; v1 submitted 16 May, 2020;
originally announced May 2020.
-
Variational Learning with Disentanglement-PyTorch
Authors:
Amir H. Abdi,
Purang Abolmaesumi,
Sidney Fels
Abstract:
Unsupervised learning of disentangled representations is an open problem in machine learning. The Disentanglement-PyTorch library is developed to facilitate research, implementation, and testing of new variational algorithms. In this modular library, neural architectures, dimensionality of the latent space, and the training algorithms are fully decoupled, allowing for independent and consistent ex…
▽ More
Unsupervised learning of disentangled representations is an open problem in machine learning. The Disentanglement-PyTorch library is developed to facilitate research, implementation, and testing of new variational algorithms. In this modular library, neural architectures, dimensionality of the latent space, and the training algorithms are fully decoupled, allowing for independent and consistent experiments across variational methods. The library handles the training scheduling, logging, and visualizations of reconstructions and latent space traversals. It also evaluates the encodings based on various disentanglement metrics. The library, so far, includes implementations of the following unsupervised algorithms VAE, Beta-VAE, Factor-VAE, DIP-I-VAE, DIP-II-VAE, Info-VAE, and Beta-TCVAE, as well as conditional approaches such as CVAE and IFCVAE. The library is compatible with the Disentanglement Challenge of NeurIPS 2019, hosted on AICrowd, and achieved the 3rd rank in both the first and second stages of the challenge.
△ Less
Submitted 11 December, 2019;
originally announced December 2019.
-
A Study into Echocardiography View Conversion
Authors:
Amir H. Abdi,
Mohammad H. Jafari,
Sidney Fels,
Theresa Tsang,
Purang Abolmaesumi
Abstract:
Transthoracic echo is one of the most common means of cardiac studies in the clinical routines. During the echo exam, the sonographer captures a set of standard cross sections (echo views) of the heart. Each 2D echo view cuts through the 3D cardiac geometry via a unique plane. Consequently, different views share some limited information. In this work, we investigate the feasibility of generating a…
▽ More
Transthoracic echo is one of the most common means of cardiac studies in the clinical routines. During the echo exam, the sonographer captures a set of standard cross sections (echo views) of the heart. Each 2D echo view cuts through the 3D cardiac geometry via a unique plane. Consequently, different views share some limited information. In this work, we investigate the feasibility of generating a 2D echo view using another view based on adversarial generative models. The objective optimized to train the view-conversion model is based on the ideas introduced by LSGAN, PatchGAN and Conditional GAN (cGAN). The size and length of the left ventricle in the generated target echo view is compared against that of the target ground-truth to assess the validity of the echo view conversion. Results show that there is a correlation of 0.50 between the LV areas and 0.49 between the LV lengths of the generated target frames and the real target frames.
△ Less
Submitted 5 December, 2019;
originally announced December 2019.
-
A Preliminary Study of Disentanglement With Insights on the Inadequacy of Metrics
Authors:
Amir H. Abdi,
Purang Abolmaesumi,
Sidney Fels
Abstract:
Disentangled encoding is an important step towards a better representation learning. However, despite the numerous efforts, there still is no clear winner that captures the independent features of the data in an unsupervised fashion. In this work we empirically evaluate the performance of six unsupervised disentanglement approaches on the mpi3d toy dataset curated and released for the NeurIPS 2019…
▽ More
Disentangled encoding is an important step towards a better representation learning. However, despite the numerous efforts, there still is no clear winner that captures the independent features of the data in an unsupervised fashion. In this work we empirically evaluate the performance of six unsupervised disentanglement approaches on the mpi3d toy dataset curated and released for the NeurIPS 2019 Disentanglement Challenge. The methods investigated in this work are Beta-VAE, Factor-VAE, DIP-I-VAE, DIP-II-VAE, Info-VAE, and Beta-TCVAE. The capacities of all models were progressively increased throughout the training and the hyper-parameters were kept intact across experiments. The methods were evaluated based on five disentanglement metrics, namely, DCI, Factor-VAE, IRS, MIG, and SAP-Score. Within the limitations of this study, the Beta-TCVAE approach was found to outperform its alternatives with respect to the normalized sum of metrics. However, a qualitative study of the encoded latents reveal that there is not a consistent correlation between the reported metrics and the disentanglement potential of the model.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.
-
EEG-to-F0: Establishing artificial neuro-muscular pathway for kinematics-based fundamental frequency control
Authors:
Himanshu Goyal,
Pramit Saha,
Bryan Gick,
Sidney Fels
Abstract:
The fundamental frequency (F0) of human voice is generally controlled by changing the vocal fold parameters (including tension, length and mass), which in turn is manipulated by the muscle exciters, activated by the neural synergies. In order to begin investigating the neuromuscular F0 control pathway, we simulate a simple biomechanical arm prototype (instead of an artificial vocal tract) that ten…
▽ More
The fundamental frequency (F0) of human voice is generally controlled by changing the vocal fold parameters (including tension, length and mass), which in turn is manipulated by the muscle exciters, activated by the neural synergies. In order to begin investigating the neuromuscular F0 control pathway, we simulate a simple biomechanical arm prototype (instead of an artificial vocal tract) that tends to control F0 of an artificial sound synthesiser based on the elbow movements. The intended arm movements are decoded from the EEG signal inputs (collected simultaneously with the kinematic hand data of the participant) through a combined machine learning and biomechanical modeling strategy. The machine learning model is employed to identify the muscle activation of a single-muscle arm model in ArtiSynth (from input brain signal), in order to match the actual kinematic (elbow joint angle) data . The biomechanical model utilises this estimated muscle excitation to produce corresponding changes in elbow angle, which is then linearly mapped to F0 of a vocal sound synthesiser. We use the F0 value mapped from the actual kinematic hand data (via same function) as the ground truth and compare the F0 estimated from brain signal. A detailed qualitative and quantitative performance comparison shows that the proposed neuromuscular pathway can indeed be utilised to accurately control the vocal fundamental frequency, thereby demonstrating the success of our closed loop neuro-biomechanical control scheme.
△ Less
Submitted 24 September, 2019;
originally announced October 2019.
-
An extended two-dimensional vocal tract model for fast acoustic simulation of single-axis symmetric three-dimensional tubes
Authors:
Debasish Ray Mohapatra,
Victor Zappi,
Sidney Fels
Abstract:
The simulation of two-dimensional (2D) wave propagation is an affordable computational task and its use can potentially improve time performance in vocal tracts' acoustic analysis. Several models have been designed that rely on 2D wave solvers and include 2D representations of three-dimensional (3D) vocal tract-like geometries. However, until now, only the acoustics of straight 3D tubes with circu…
▽ More
The simulation of two-dimensional (2D) wave propagation is an affordable computational task and its use can potentially improve time performance in vocal tracts' acoustic analysis. Several models have been designed that rely on 2D wave solvers and include 2D representations of three-dimensional (3D) vocal tract-like geometries. However, until now, only the acoustics of straight 3D tubes with circular cross-sections have been successfully replicated with this approach. Furthermore, the simulation of the resulting 2D shapes requires extremely high spatio-temporal resolutions, dramatically reducing the speed boost deriving from the usage of a 2D wave solver. In this paper, we introduce an in-progress novel vocal tract model that extends the 2D Finite-Difference Time-Domain wave solver (2.5D FDTD) by adding tube depth, derived from the area functions, to the acoustic solver. The model combines the speed of a light 2D numerical scheme with the ability to natively simulate 3D tubes that are symmetric in one dimension, hence relaxing previous resolution requirements. An implementation of the 2.5D FDTD is presented, along with evaluation of its performance in the case of static vowel modeling. The paper discusses the current features and limits of the approach, and the potential impact on computational acoustics applications.
△ Less
Submitted 18 September, 2019;
originally announced September 2019.
-
Variational Shape Completion for Virtual Planning of Jaw Reconstructive Surgery
Authors:
Amir H. Abdi,
Mehran Pesteie,
Eitan Prisman,
Purang Abolmaesumi,
Sidney Fels
Abstract:
The premorbid geometry of the mandible is of significant relevance in jaw reconstructive surgeries and occasionally unknown to the surgical team. In this paper, an optimization framework is introduced to train deep models for completion (reconstruction) of the missing segments of the bone based on the remaining healthy structure. To leverage the contextual information of the surroundings of the di…
▽ More
The premorbid geometry of the mandible is of significant relevance in jaw reconstructive surgeries and occasionally unknown to the surgical team. In this paper, an optimization framework is introduced to train deep models for completion (reconstruction) of the missing segments of the bone based on the remaining healthy structure. To leverage the contextual information of the surroundings of the dissected region, the voxel-weighted Dice loss is introduced. To address the non-deterministic nature of the shape completion problem, we leverage a weighted multi-target probabilistic solution which is an extension to the conditional variational autoencoder (CVAE). This approach considers multiple targets as acceptable reconstructions, each weighted according to their conformity with the original shape. We quantify the performance gain of the proposed method against similar algorithms, including CVAE, where we report statistically significant improvements in both deterministic and probabilistic paradigms. The probabilistic model is also evaluated on its ability to generate anatomically relevant variations for the missing bone. As a unique aspect of this work, the model is tested on real surgical cases where the clinical relevancy of its reconstructions and their compliance with surgeon's virtual plan are demonstrated as necessary steps towards clinical adoption.
△ Less
Submitted 15 July, 2019; v1 submitted 27 June, 2019;
originally announced June 2019.
-
SPEAK YOUR MIND! Towards Imagined Speech Recognition With Hierarchical Deep Learning
Authors:
Pramit Saha,
Muhammad Abdul-Mageed,
Sidney Fels
Abstract:
Speech-related Brain Computer Interface (BCI) technologies provide effective vocal communication strategies for controlling devices through speech commands interpreted from brain signals. In order to infer imagined speech from active thoughts, we propose a novel hierarchical deep learning BCI system for subject-independent classification of 11 speech tokens including phonemes and words. Our novel…
▽ More
Speech-related Brain Computer Interface (BCI) technologies provide effective vocal communication strategies for controlling devices through speech commands interpreted from brain signals. In order to infer imagined speech from active thoughts, we propose a novel hierarchical deep learning BCI system for subject-independent classification of 11 speech tokens including phonemes and words. Our novel approach exploits predicted articulatory information of six phonological categories (e.g., nasal, bilabial) as an intermediate step for classifying the phonemes and words, thereby finding discriminative signal responsible for natural speech synthesis. The proposed network is composed of hierarchical combination of spatial and temporal CNN cascaded with a deep autoencoder. Our best models on the KARA database achieve an average accuracy of 83.42% across the six different binary phonological classification tasks, and 53.36% for the individual token identification task, significantly outperforming our baselines. Ultimately, our work suggests the possible existence of a brain imagery footprint for the underlying articulatory movement related to different sounds that can be used to aid imagined speech decoding.
△ Less
Submitted 8 April, 2019;
originally announced April 2019.
-
Deep Learning the EEG Manifold for Phonological Categorization from Active Thoughts
Authors:
Pramit Saha,
Muhammad Abdul-Mageed,
Sidney Fels
Abstract:
Speech-related Brain Computer Interfaces (BCI) aim primarily at finding an alternative vocal communication pathway for people with speaking disabilities. As a step towards full decoding of imagined speech from active thoughts, we present a BCI system for subject-independent classification of phonological categories exploiting a novel deep learning based hierarchical feature extraction scheme. To b…
▽ More
Speech-related Brain Computer Interfaces (BCI) aim primarily at finding an alternative vocal communication pathway for people with speaking disabilities. As a step towards full decoding of imagined speech from active thoughts, we present a BCI system for subject-independent classification of phonological categories exploiting a novel deep learning based hierarchical feature extraction scheme. To better capture the complex representation of high-dimensional electroencephalography (EEG) data, we compute the joint variability of EEG electrodes into a channel cross-covariance matrix. We then extract the spatio-temporal information encoded within the matrix using a mixed deep neural network strategy. Our model framework is composed of a convolutional neural network (CNN), a long-short term network (LSTM), and a deep autoencoder. We train the individual networks hierarchically, feeding their combined outputs in a final gradient boosting classification step. Our best models achieve an average accuracy of 77.9% across five different binary classification tasks, providing a significant 22.5% improvement over previous methods. As we also show visually, our work demonstrates that the speech imagery EEG possesses significant discriminative information about the intended articulatory movements responsible for natural speech synthesis.
△ Less
Submitted 8 April, 2019;
originally announced April 2019.
-
Hierarchical Deep Feature Learning For Decoding Imagined Speech From EEG
Authors:
Pramit Saha,
Sidney Fels
Abstract:
We propose a mixed deep neural network strategy, incorporating parallel combination of Convolutional (CNN) and Recurrent Neural Networks (RNN), cascaded with deep autoencoders and fully connected layers towards automatic identification of imagined speech from EEG. Instead of utilizing raw EEG channel data, we compute the joint variability of the channels in the form of a covariance matrix that pro…
▽ More
We propose a mixed deep neural network strategy, incorporating parallel combination of Convolutional (CNN) and Recurrent Neural Networks (RNN), cascaded with deep autoencoders and fully connected layers towards automatic identification of imagined speech from EEG. Instead of utilizing raw EEG channel data, we compute the joint variability of the channels in the form of a covariance matrix that provide spatio-temporal representations of EEG. The networks are trained hierarchically and the extracted features are passed onto the next network hierarchy until the final classification. Using a publicly available EEG based speech imagery database we demonstrate around 23.45% improvement of accuracy over the baseline method. Our approach demonstrates the promise of a mixed DNN approach for complex spatial-temporal classification problems.
△ Less
Submitted 8 April, 2019;
originally announced April 2019.
-
Human Computer Interaction Design for Mobile Devices Based on a Smart Healthcare Architecture
Authors:
Pu Liu,
Sidney Fels,
Nicholas West,
Matthias Görges
Abstract:
Smart and IoT-enabled mobile devices have the potential to enhance healthcare services for both patients and healthcare providers. Human computer interaction design is key to realizing a useful and usable connection between the users and these smart healthcare technologies. Appropriate design of such devices enhances the usability, improves effective operation in an integrated healthcare system, a…
▽ More
Smart and IoT-enabled mobile devices have the potential to enhance healthcare services for both patients and healthcare providers. Human computer interaction design is key to realizing a useful and usable connection between the users and these smart healthcare technologies. Appropriate design of such devices enhances the usability, improves effective operation in an integrated healthcare system, and facilitates the collaboration and information sharing between patients, healthcare providers, and institutions. In this paper, the concept of smart healthcare is introduced, including its four-layer information architecture of sensing, communication, data integration, and application. Human Computer Interaction design principles for smart healthcare mobile devices are outlined, based on user-centered design. These include: ensuring safety, providing error-resistant displays and alarms, supporting the unique relationship between patients and healthcare providers, distinguishing end-user groups, accommodating legacy devices, guaranteeing low latency, allowing for personalization, and ensuring patient privacy. Results are synthesized in design suggestions ranging from personas, scenarios, workflow, and information architecture, to prototyping, testing and iterative development. Finally, future developments in smart healthcare and Human Computer Interaction design for mobile health devices are outlined.
△ Less
Submitted 10 February, 2019;
originally announced February 2019.
-
Sound-Stream II: Towards Real-Time Gesture Controlled Articulatory Sound Synthesis
Authors:
Pramit Saha,
Debasish Ray Mohapatra,
Praneeth SV,
Sidney Fels
Abstract:
We present an interface involving four degrees-of-freedom (DOF) mechanical control of a two dimensional, mid-sagittal tongue through a biomechanical toolkit called ArtiSynth and a sound synthesis engine called JASS towards articulatory sound synthesis. As a demonstration of the project, the user will learn to produce a range of JASS vocal sounds, by varying the shape and position of the ArtiSynth…
▽ More
We present an interface involving four degrees-of-freedom (DOF) mechanical control of a two dimensional, mid-sagittal tongue through a biomechanical toolkit called ArtiSynth and a sound synthesis engine called JASS towards articulatory sound synthesis. As a demonstration of the project, the user will learn to produce a range of JASS vocal sounds, by varying the shape and position of the ArtiSynth tongue in 2D space through a set of four force-based sensors. In other words, the user will be able to physically play around with these four sensors, thereby virtually controlling the magnitude of four selected muscle excitations of the tongue to vary articulatory structure. This variation is computed in terms of Area Functions in ArtiSynth environment and communicated to the JASS based audio-synthesizer coupled with two-mass glottal excitation model to complete this end-to-end gesture-to-sound mapping.
△ Less
Submitted 19 November, 2018;
originally announced November 2018.
-
Limitations of Source-Filter Coupling In Phonation
Authors:
Debasish Ray Mohapatra,
Sidney Fels
Abstract:
The coupling of vocal fold (source) and vocal tract (filter) is one of the most critical factors in source-filter articulation theory. The traditional linear source-filter theory has been challenged by current research which clearly shows the impact of acoustic loading on the dynamic behavior of the vocal fold vibration as well as the variations in the glottal flow pulses shape. This paper outline…
▽ More
The coupling of vocal fold (source) and vocal tract (filter) is one of the most critical factors in source-filter articulation theory. The traditional linear source-filter theory has been challenged by current research which clearly shows the impact of acoustic loading on the dynamic behavior of the vocal fold vibration as well as the variations in the glottal flow pulses shape. This paper outlines the underlying mechanism of source-filter interactions; demonstrates the design and working principles of coupling for the various existing vocal cord and vocal tract biomechanical models. For our study, we have considered self-oscillating lumped-element models of the acoustic source and computational models of the vocal tract as articulators. To understand the limitations of source-filter interactions which are associated with each of those models, we compare them concerning their mechanical design, acoustic and physiological characteristics and aerodynamic simulation.
△ Less
Submitted 18 November, 2018;
originally announced November 2018.
-
Muscle Excitation Estimation in Biomechanical Simulation Using NAF Reinforcement Learning
Authors:
Amir H. Abdi,
Pramit Saha,
Praneeth Srungarapu,
Sidney Fels
Abstract:
Motor control is a set of time-varying muscle excitations which generate desired motions for a biomechanical system. Muscle excitations cannot be directly measured from live subjects. An alternative approach is to estimate muscle activations using inverse motion-driven simulation. In this article, we propose a deep reinforcement learning method to estimate the muscle excitations in simulated biome…
▽ More
Motor control is a set of time-varying muscle excitations which generate desired motions for a biomechanical system. Muscle excitations cannot be directly measured from live subjects. An alternative approach is to estimate muscle activations using inverse motion-driven simulation. In this article, we propose a deep reinforcement learning method to estimate the muscle excitations in simulated biomechanical systems. Here, we introduce a custom-made reward function which incentivizes faster point-to-point tracking of target motion. Moreover, we deploy two new techniques, namely, episode-based hard update and dual buffer experience replay, to avoid feedback training loops. The proposed method is tested in four simulated 2D and 3D environments with 6 to 24 axial muscles. The results show that the models were able to learn muscle excitations for given motions after nearly 100,000 simulated steps. Moreover, the root mean square error in point-to-point reaching of the target across experiments was less than 1% of the length of the domain of motion. Our reinforcement learning method is far from the conventional dynamic approaches as the muscle control is derived functionally by a set of distributed neurons. This can open paths for neural activity interpretation of this phenomenon.
△ Less
Submitted 3 May, 2019; v1 submitted 17 September, 2018;
originally announced September 2018.
-
Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI
Authors:
Pramit Saha,
Praneeth Srungarapu,
Sidney Fels
Abstract:
Vocal tract configurations play a vital role in generating distinguishable speech sounds, by modulating the airflow and creating different resonant cavities in speech production. They contain abundant information that can be utilized to better understand the underlying speech production mechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, this paper employs ef…
▽ More
Vocal tract configurations play a vital role in generating distinguishable speech sounds, by modulating the airflow and creating different resonant cavities in speech production. They contain abundant information that can be utilized to better understand the underlying speech production mechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, this paper employs effective video action recognition techniques, like Long-term Recurrent Convolutional Networks (LRCN) models, to identify different vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract. Such a model typically combines a CNN based deep hierarchical visual feature extractor with Recurrent Networks, that ideally makes the network spatio-temporally deep enough to learn the sequential dynamics of a short video clip for video classification tasks. We use a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterances by 17 speakers. The comparative performances of this class of algorithms under various parameter settings and for various classification tasks are discussed. Interestingly, the results show a marked difference in the model performance in the context of speech classification with respect to generic sequence or video classification tasks.
△ Less
Submitted 29 July, 2018;
originally announced July 2018.
-
Spectral Study of the Vocal Tract in Vowel Synthesis: A Comparison between 1D and 3D Acoustic Analysis
Authors:
Negar M. Harandi,
Daniel Aalto,
Antti Hannukainen,
Jarmo Malinen,
Sidney Fels
Abstract:
A state-of-the-art 1D acoustic synthesizer has been previously developed, and coupled to speaker-specific biomechanical models of oropharynx in ArtiSynth. As expected, the formant frequencies of the synthesized vowel sounds were shown to be different from those of the recorded audio. Such discrepancy was hypothesized to be due to the simplified geometry of the vocal tract model as well as the one…
▽ More
A state-of-the-art 1D acoustic synthesizer has been previously developed, and coupled to speaker-specific biomechanical models of oropharynx in ArtiSynth. As expected, the formant frequencies of the synthesized vowel sounds were shown to be different from those of the recorded audio. Such discrepancy was hypothesized to be due to the simplified geometry of the vocal tract model as well as the one dimensional implementation of Navier-Stokes equations. In this paper, we calculate Helmholtz resonances of our vocal tract geometries using 3D finite element method (FEM), and compare them with the formant frequencies obtained from the 1D method and audio. We hope such comparison helps with clarifying the limitations of our current models and/or speech synthesizer.
△ Less
Submitted 17 December, 2015;
originally announced December 2015.