Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Motivation The convolutional neural network (CNN) has been applied to the classification problem of DNA sequences, with the additional purpose of motif discovery. The training of CNNs with distributed representations of four nucleotides... more
Motivation The convolutional neural network (CNN) has been applied to the classification problem of DNA sequences, with the additional purpose of motif discovery. The training of CNNs with distributed representations of four nucleotides has successfully derived position weight matrices on the learned kernels that corresponded to sequence motifs such as protein-binding sites. Results We propose a novel application of CNNs to classification of pairwise alignments of sequences for accurate clustering of sequences and show the benefits of the CNN method of inputting pairwise alignments for clustering of non-coding RNA (ncRNA) sequences and for motif discovery. Classification of a pairwise alignment of two sequences into positive and negative classes corresponds to the clustering of the input sequences. After we combined the distributed representation of RNA nucleotides with the secondary-structure information specific to ncRNAs and furthermore with mapping profiles of next-generation se...
The goal of this study is to investigate advanced signal processing approaches [single frequency filtering (SFF) and zero-time windowing (ZTW)] with modern deep neural networks (DNNs) [convolution neural networks (CNNs), temporal... more
The goal of this study is to investigate advanced signal processing approaches [single frequency filtering (SFF) and zero-time windowing (ZTW)] with modern deep neural networks (DNNs) [convolution neural networks (CNNs), temporal convolution neural networks (TCN), time-delay neural network (TDNN), and emphasized channel attention, propagation and aggregation in TDNN (ECAPA-TDNN)] for dialect classification of major dialects of English. Previous studies indicated that SFF and ZTW methods provide higher spectro-temporal resolution. To capture the intrinsic variations in articulations among dialects, four feature representations [spectrogram (SPEC), cepstral coefficients, mel filter-bank energies, and mel-frequency cepstral coefficients (MFCCs)] are derived from SFF and ZTW methods. Experiments with and without data augmentation using CNN classifiers revealed that the proposed features performed better than baseline short-time Fourier transform (STFT)-based features on the UT-Podcast d...
In low resource children automatic speech recognition (ASR) the performance is degraded due to limited acoustic and speaker variability available in small datasets. In this paper, we propose a spectral warping based data augmentation... more
In low resource children automatic speech recognition (ASR) the performance is degraded due to limited acoustic and speaker variability available in small datasets. In this paper, we propose a spectral warping based data augmentation method to capture more acoustic and speaker variability. This is carried out by warping the linear prediction (LP) spectra computed from speech data. The warped LP spectra computed in a frame-based manner are used with the corresponding LP residuals to synthesize speech to capture more variability. The proposed augmentation method is shown to improve the ASR system performance over the baseline system. We have compared the proposed method with four well-known data augmentation methods: pitch scaling, speaking rate, SpecAug and vocal tract length perturbation (VTLP), and found that the proposed method performs the best. Further, we have combined the proposed method with these existing data augmentation methods to improve the ASR system performance even m...
Understanding of the perception of emotions or affective states in humans is important to develop emotion-aware systems that work in realistic scenarios. In this paper, the perception of emotions in naturalistic human interaction... more
Understanding of the perception of emotions or affective states in humans is important to develop emotion-aware systems that work in realistic scenarios. In this paper, the perception of emotions in naturalistic human interaction (audio–visual data) is studied using perceptual evaluation. For this purpose, a naturalistic audio–visual emotion database collected from TV broadcasts such as soap-operas and movies, called the IIIT-H Audio–Visual Emotion (IIIT-H AVE) database, is used. The database consists of audio-alone, video-alone, and audio–visual data in English. Using data of all three modes, perceptual tests are conducted for four basic emotions (angry, happy, neutral, and sad) based on category labeling and for two dimensions, namely arousal (active or passive) and valence (positive or negative), based on dimensional labeling. The results indicated that the participants’ perception of emotions was remarkably different between the audio-alone, video-alone, and audio–video data. Th...
End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for... more
End-to-end neural network models (E2E) have shown significant performance benefits on different INTERSPEECH ComParE tasks. Prior work has applied either a single instance of an E2E model for a task or the same E2E architecture for different tasks. However, applying a single model is unstable or using the same architecture under-utilizes task-specific information. On ComParE 2020 tasks, we investigate applying an ensemble of E2E models for robust performance and developing task-specific modifications for each task. ComParE 2020 introduces three sub-challenges: the breathing sub-challenge to predict the output of a respiratory belt worn by a patient while speaking, the elderly sub-challenge to estimate the elderly speaker's arousal and valence levels and the mask sub-challenge to classify if the speaker is wearing a mask or not. On each of these tasks, an ensemble outperforms the single E2E model. On the breathing sub-challenge, we study the impact of multi-loss strategies on task...
During production of emotional speech there are deviations in the components of speech production mechanism when compared to normal speech. The objective of this study is to capture the deviations in features related to the excitation... more
During production of emotional speech there are deviations in the components of speech production mechanism when compared to normal speech. The objective of this study is to capture the deviations in features related to the excitation source component of speech, and to develop a system for automatic recognition of emotions based on these deviations. The emotions considered for this study are: anger, happy, neutral and sad. The study shows that there are useful features in the deviations of the excitation source features at subsegmental level, and they can be exploited to develop an emotion recognition system. A hierarchical binary decision tree approach is used for classification.
In this paper, we present a multispeaker localization method using the time delay estimates obtained from the spectral features derived from the single frequency filter (SFF) representation. The mixture signals are transformed into SFF... more
In this paper, we present a multispeaker localization method using the time delay estimates obtained from the spectral features derived from the single frequency filter (SFF) representation. The mixture signals are transformed into SFF domain from which the temporal envelopes are extracted at each frequency. Subsequently, the spectral features such as mean and variance of temporal envelopes across frequencies are correlated for extracting the time delay estimates. Since these features emphasize the high SNR regions of the mixtures, correlation of the corresponding features across the channels leads to robust delay estimates in real acoustic environments. We study the efficacy of the developed approach by comparing its performance with the existing correlation based time delay estimation techniques. Both, a standard data set recorded in real-room acoustic environments and simulated data set are used for evaluations. It is observed that the localization performance of the proposed alg...
In this paper, we address the issue of speaker-specific emotion detection (neutral vs emotion) from speech signals with models for neutral speech as reference. As emotional speech is produced by the human speech production mechanism, the... more
In this paper, we address the issue of speaker-specific emotion detection (neutral vs emotion) from speech signals with models for neutral speech as reference. As emotional speech is produced by the human speech production mechanism, the emotion information is expected to lie in the features of both excitation source and the vocal tract system. Linear Prediction residual is used as the excitation source component and Linear Prediction Coefficients as the vocal tract system component. A pitch synchronous analysis is performed. Separate Autoassociative Neural Network models are developed to capture the information specific to neutral speech, from the excitation and the vocal tract system components. Experimental results show that the excitation source carries more information than the vocal tract system. The accuracy neutral vs emotion classification using excitation source information is 91%, which is 8% higher than the accuracy obtained using vocal tract system information. The Berl...
In this paper, we address the issue of speech polarity detection using strength of impulse-like excitation around epoch. The correct detection of speech polarity is a crucial step for many speech processing algorithms to extract suitable... more
In this paper, we address the issue of speech polarity detection using strength of impulse-like excitation around epoch. The correct detection of speech polarity is a crucial step for many speech processing algorithms to extract suitable information. Occurrence of errors in the detection of speech polarity could have an impact on the performance of speech systems. Automatic detection of speech polarity has become an important preliminary step for many speech processing algorithms. We propose a method based on the knowledge of impulse-like excitation of speech production mechanism. The impulse-like excitation is reflected across all frequencies including the zero frequency (0 Hz). Using the slope around zero crossings of the zero frequency filtered signal, an automatic speech polarity detection method is proposed. Performance of the proposed method is demonstrated on 8 different speech corpora. The proposed method is compared with the three existing techniques such as gradient of the spurious glottal waveforms (GSGW), oscillating moments-based polarity detection (OMPD) and residual excitation skewness (RESKEW). From the experimental results, it is observed that the performance of the proposed method is comparable or better than the existing methods for the experiments considered.
During production of emotional speech there are deviations in the components of speech production mechanism when compared to normal speech. The objective of this study is to capture the deviations in features related to the excitation... more
During production of emotional speech there are deviations in the components of speech production mechanism when compared to normal speech. The objective of this study is to capture the deviations in features related to the excitation source component of speech, and to develop a system for automatic recognition of emotions based on these deviations. The emotions considered for this study are: anger, happy, neutral and sad. The study shows that there are useful features in the deviations of the excitation source features at subsegmental level, and they can be exploited to develop an emotion recognition system. A hierarchical binary decision tree approach is used for classification.
The progress in the areas of research like emotion recognition, identification, synthesis, etc., relies heavily on the development and structure of the database. This paper addresses some of the key issues in development of the emotion... more
The progress in the areas of research like emotion recognition, identification, synthesis, etc., relies heavily on the development and structure of the database. This paper addresses some of the key issues in development of the emotion databases. A new audio-visual emotion (AVE) database is developed. The database consists of audio, video and audio-visual clips sourced from TV broadcast like movies and soapoperas in English language. The data clips are manually segregated in an emotion and speaker specific way. This database is developed to address the emotion recognition in actual human interaction. The database is structured in such a way that it might be useful in a variety of applications like emotion analysis based on speaker or gender, emotion identification in multiple emotive dialogue scenarios etc.
In this paper, we propose spectral modification by sharpening formants and by reducing the spectral tilt to recognize children’s speech by automatic speech recognition (ASR) systems developed using adult speech. In this type of mismatched... more
In this paper, we propose spectral modification by sharpening formants and by reducing the spectral tilt to recognize children’s speech by automatic speech recognition (ASR) systems developed using adult speech. In this type of mismatched condition, the ASR performance is degraded due to the acoustic and linguistic mismatch in the attributes between children and adult speakers. The proposed method is used to improve the speech intelligibility to enhance the children’s speech recognition using an acoustic model trained on adult speech. In the experiments, WSJCAM0 and PFSTAR are used as databases for adults’ and children’s speech, respectively. The proposed technique gives a significant improvement in the context of the DNN-HMM-based ASR. Furthermore, we validate the robustness of the technique by showing that it performs well also in mismatched noise conditions.
Glottal source characteristics vary between phonation types due to the tension of laryngeal muscles with the respiratory effort. Previous studies in the classification of phonation type have mainly used speech signals recorded by... more
Glottal source characteristics vary between phonation types due to the tension of laryngeal muscles with the respiratory effort. Previous studies in the classification of phonation type have mainly used speech signals recorded by microphone. Recently, two studies were published in the classification of phonation type using neck surface accelerometer (NSA) signals. However, there are no previous studies comparing the use of the acoustic speech signal vs. the NSA signal as input in classifying phonation type. Therefore, the current study investigates simultaneously recorded speech and NSA signals in the classification of three phonation types (breathy, modal, pressed). The general goal is to understand which of the two signals (speech vs. NSA) is more effective in the classification task. We hypothesize that by using the same feature set for both signals, classification accuracy is higher for the NSA signal, which is more closely related to the physical vibration of the vocal folds an...
Through speech production mechanism, speech with different voice qualities such as phonations, emotions, expressive singing and other paralinguistic sounds are also produced. Most of these sounds demonstrate these features mostly due to... more
Through speech production mechanism, speech with different voice qualities such as phonations, emotions, expressive singing and other paralinguistic sounds are also produced. Most of these sounds demonstrate these features mostly due to the excitation component (vibration of the vocal folds at the glottis) whereas the dynamic vocal tract system primarily conveys the message. Hence, the excitation source processing acquires significance especially for the analysis, detection and representation of expressive voices. Most of the existing excitation source information extraction methods are not reliable especially when applied on expressive voices, mainly due to significant source-system coupling. Hence, there is a need for new signal processing methods that can capture the dynamic variations in excitation source so that different types of sounds can be better analyzed and represented. The objective of this work is to derive new signal processing methods to extract the excitation source...
Most of the applications in speech use mel-frequency spectral coefficients (MFSC) as features as they match the human perceptual mechanism, where the emphasis is given to vocal tract characteristics. But in accent classification,... more
Most of the applications in speech use mel-frequency spectral coefficients (MFSC) as features as they match the human perceptual mechanism, where the emphasis is given to vocal tract characteristics. But in accent classification, mel-scale distribution of filters may not always be the best representations, e.g., pitch accented languages where the emphasis should be on vocal source information too. Motivated by this, we use end-to-end classification of accents directly from waveforms which will reduce the effort of designing features specific to each corpus. The convolution neural network (CNN) model architecture is designed in such a way that the initial layers exhibit similar operation as in MFSC by initializing the weights using time approximate of MFSC. The entire network along with initial layers is trained to learn accent classification. We observed that learning directly from waveform improved the performance of accent classification when compared to CNN trained on hand-engine...
In this paper, we propose a new method for the accurate estimation and tracking of formants in speech signals using time-varying quasi-closed-phase (TVQCP) analysis. Conventional formant tracking methods typically adopt a two-stage... more
In this paper, we propose a new method for the accurate estimation and tracking of formants in speech signals using time-varying quasi-closed-phase (TVQCP) analysis. Conventional formant tracking methods typically adopt a two-stage estimate-and-track strategy wherein an initial set of formant candidates are estimated using short-time analysis (e.g., 10–50 ms), followed by a tracking stage based on dynamic programming or a linear state-space model. One of the main disadvantages of these approaches is that the tracking stage, however good it may be, cannot improve upon the formant estimation accuracy of the first stage. The proposed TVQCP method provides a single-stage formant tracking that combines the estimation and tracking stages into one. TVQCP analysis combines three approaches to improve formant estimation and tracking: (1) it uses temporally weighted quasi-closed-phase analysis to derive closed-phase estimates of the vocal tract with reduced interference from the excitation source, (2) it increases the residual sparsity by using the $L_1$ optimization and (3) it uses time-varying linear prediction analysis over long time windows (e.g., 100–200 ms) to impose a continuity constraint on the vocal tract model and hence on the formant trajectories. Formant tracking experiments with a wide variety of synthetic and natural speech signals show that the proposed TVQCP method performs better than conventional and popular formant tracking tools, such as Wavesurfer and Praat (based on dynamic programming), the KARMA algorithm (based on Kalman filtering), and DeepFormants (based on deep neural networks trained in a supervised manner). Matlab scripts for the proposed method can be found at: https://github.com/njaygowda/ftrack
Studies on the emotion recognition task indicate that there is confusion in discrimination among higher activation states like ‘anger’ and ‘happy’. In this study, features related to excitation source of speech are examined for... more
Studies on the emotion recognition task indicate that there is confusion in discrimination among higher activation states like ‘anger’ and ‘happy’. In this study, features related to excitation source of speech are examined for discriminating ‘anger’ and ‘happy’ emotions. The objective is to explore the features which are independent of lexical content, language, channel and speaker. The features like strength of excitation from zero frequency filtering method and spectral band magnitude energies from short-time spectral analysis are used. Experimental results show that these features can discriminate ‘anger’ and ‘happy’ emotion states to a good extent. Index Terms: Emotion recognition, zero frequency filtering method, KL distance measure.
Current ASR systems show poor performance in recognition of children’s speech in noisy environments because recognizers are typically trained with clean adults’ speech and therefore there are two mismatches between training and testing... more
Current ASR systems show poor performance in recognition of children’s speech in noisy environments because recognizers are typically trained with clean adults’ speech and therefore there are two mismatches between training and testing phases (i.e., clean speech in training vs. noisy speech in testing and adult speech in training vs. child speech in testing). This article studies methods to tackle the effects of these two mismatches in recognition of noisy children’s speech by investigating two techniques: data augmentation and time-scale modification. In the former, clean training data of adult speakers are corrupted with additive noise in order to obtain training data that better correspond to the noisy testing conditions. In the latter, the fundamental frequency (F0) and speaking rate of children’s speech are modified in the testing phase in order to reduce differences in the prosodic characteristics between the testing data of child speakers and the training data of adult speake...
Emotional speech is produced when a speaker is in a state differ- ent from normal state. The objective of this study is to explore the deviations in the excitation source features of an emotional speech compared to normal speech. The... more
Emotional speech is produced when a speaker is in a state differ- ent from normal state. The objective of this study is to explore the deviations in the excitation source features of an emotional speech compared to normal speech. The features used for anal- ysis are extracted at subsegmental level (1-3 ms) of speech. A comparative study of these features across different emotions indicates that there are significant deviations in the subsegmen- tal level features of speech in emotional state when compared to normal state
Research Interests:

And 30 more