Skip to main content

nayeemulla khan

VIT University, SCSE, Faculty Member

Followers

23

Following

2

Mentions

1

Public Views

Address: Vellore, Tamil Nadu, India

less

Interests

Uploads

Papers by nayeemulla khan

A new multi-stream approach using acoustic and visual features for robust speech recognition system

Materials Today: Proceedings, 2022

A CNN based Speaker Recognition System using an Alternate Bone Microphone

International Journal of Innovative Technology and Exploring Engineering, 2019

State-of-art speaker recognition system uses acoustic microphone speech to identify/verify a spea... more State-of-art speaker recognition system uses acoustic microphone speech to identify/verify a speaker. The multimodal speaker recognition system includes modality of input data recorded using sources like acoustics mic,array mic ,throat mic, bone mic and video recorder. In this paper we implemented a multi-modal speaker identification system with three modality of speech as input, recorded from different microphones like air mic, throat mic and bone mic . we propose and claim an alternate way of recording the bone speech using a throat microphone and the results of a implemented speaker recognition using CNN and spectrogram is presented. The obtained results supports our claim to use the throat microphone as suitable mic to record the bone conducted speech and the accuracy of the speaker recognition system with signal speech recorded from air microphone get improved about 10% after including the other modality of speech like throat and bone speech along with the air conducted speech.

Improving Recognition of Speech System Using Multimodal Approach

International Conference on Innovative Computing and Communications, 2018

Building an ASR system in adverse conditions is a challenging task. The performance of the ASR sy... more Building an ASR system in adverse conditions is a challenging task. The performance of the ASR system is high in clean environments. However, the variabilities such as speaker effect, transmission effect, and the environmental conditions degrade the recognition performance of the system. One way to enhance the robustness of ASR system is to use multiple sources of information about speech. In this work, two sources of additional information on speech are used to build a multimodal ASR system. A throat microphone speech and visual lip reading which is less susceptible to noise acts as alternate sources of information. Mel-frequency cepstral features are extracted from the throat signal and modeled by HMM. Pixel-based transformation methods (DCT and DWT) are used to extract the features from the viseme of the video data and modeled by HMM. Throat and visual features are combined at the feature level. The proposed system has improved recognition accuracy compared to unimodals. The digit database for the English language is used for the study. The experiments are carried out for both unimodal systems and the combined systems. The combined feature of normal and throat microphone gives 86.5% recognition accuracy. Visual speech features with the normal microphone combination produce 84% accuracy. The proposed work (combines normal, throat, and visual features) shows 94% recognition accuracy which is better compared to unimodal and bimodoal ASR systems.

Enhancing Noisy Speech using WEMD

Understanding Lombard speech: a review of compensation techniques towards improving speech based recognition systems

Artificial Intelligence Review, 2020

Building voice-based Artificial Intelligence (AI) systems that can efficiently interact with huma... more Building voice-based Artificial Intelligence (AI) systems that can efficiently interact with humans through speech has become plausible today due to rapid strides in efficient data-driven AI techniques. Such a human–machine voice interaction in real world would often involve a noisy ambience, where humans tend to speak with additional vocal effort than in a quiet ambience, to mitigate the noise-induced suppression of vocal self-feedback. This noise induced change in the vocal effort is called Lombard speech. In order to build intelligent conversational devices that can operate in a noisy ambience, it is imperative to study the characteristics and processing of Lombard speech. Though the progress of research on Lombard speech started several decades ago, it needs to be explored further in the current scenario which is seeing an explosion of voice-driven applications. The system designed to work with normal speech spoken in a quiet ambience fails to provide the same performance in changing environmental contexts. Different contexts lead to different styles of Lombard speech and hence there arises a need for efficient ways of handling variations in speaking styles in noise. The Lombard speech is also more intelligible than normal speech of a speaker. Applications like public announcement systems with speech output interface should talk with varying degrees of vocal effort to enhance naturalness in a way that humans adapt to speak in noise, in real time. This review article is an attempt to summarize the progress of work on the possible ways of processing Lombard speech to build smart and robust human–machine interactive systems with speech input–output interface, irrespective of operating environmental contexts, for different application needs. This article is a comprehensive review of the studies on Lombard speech, highlighting the key differences observed in acoustic and perceptual analysis of Lombard speech and detailing the Lombard effect compensation methods towards improving the robustness of speech based recognition systems.

Advertisement Recommendation Engine – Improving YouTube Advertisement Services

Regular, 2020

Ever since its early inception in the year 2005, YouTube has been growing exponentially in terms ... more Ever since its early inception in the year 2005, YouTube has been growing exponentially in terms of personnel and popularity, to provide video streaming services that allow users to freely utilize the platform. Initiating an advertisementbased revenue system to monetize the site by the year 2007, the Google Inc. based company has been improving the system to provide the users with advertisements on them. In this article, 7 recommendation engines are developed and compared with each other, to determine the efficiency and the user specificity of each engine. From the experiments and user-based testing conducted, it is observed that the engine that recommends advertisements utilizing the objects and the texts recognized, along with the video watch history, performs the best, by recommending the most relevant advertisements in 90% of the testing scenario.

Visual Speech Recognition using Fusion of Motion and Geometric Features

Procedia Computer Science, 2020

Epileptic Seizure Prediction Through Machine Learning and Spatio-Temporal Features Based Time Series Analysis of Intracranial Electroencephalogram Data

International Journal of Engineering and Advanced Technology, 2019

Epilepsy is a group of neurological disorders identifiable by infrequent but recurrent seizures. ... more Epilepsy is a group of neurological disorders identifiable by infrequent but recurrent seizures. Seizure prediction is widely recognized as a significant problem in the neuroscience domain. Developing a Brain-Computer Interface (BCI) for seizure prediction can provide an alert to the patient, providing a buffer time to get the necessary emergency medication or at least be able to call for help, thus improving the quality of life of the patients. A considerable number of clinical studies presented evidence of symptoms (patterns) before seizure episodes and thus, there is large research on seizure prediction, however, there is very little existing literature that illustrates the use of structured processes in machine learning for predicting seizures. Limited training data and class imbalance (EEG segments corresponding to preictal phase, the duration just before the seizure, to about an hour prior to the episode, are usually in a tiny minority) are a few challenges that need to be add...

An analysis of the effect of combining standard and alternate sensor signals on recognition of syllabic units for multimodal speech recognition

Pattern Recognition Letters, 2017

A person identification system combining recognition of face and lip-read passwords

2015 International Conference on Computing and Network Communications (CoCoNet), 2015

This paper presents a person identification system which combines recognition of facial features ... more This paper presents a person identification system which combines recognition of facial features as well as spoken word using visual features alone. It incorporates a face recognition algorithm to identify the person, followed by spoken word recognition of `lip-read' password. For face recognition, PCA is used for feature extraction, followed by a KNN based classification on the reduced dimensionality features. Spoken word recognition of passwords is performed using a Visual Lip reading (Visual ASR). The visual features corresponding to the spoken word is extracted using DWT, which are then recognized using a HMM based approaches. Since evidences from face recognition and visual lip reading could be complementary in nature, the scores from the two modalities are combined. Based on the combined evidences, decision making is for person identification is carried out. The performance for face identification is 90% while the accuracy for visual speech recognition is 72%. By combining these evidences, an improved accuracy 98% is achieved.

An Improved Visual Speech Recognition of Isolated Words using Combined Pixel and Geometric Features

Indian Journal of Science and Technology, 2016

Automated Health Monitoring Through Emotion Identification

Advances in Intelligent Systems and Computing, 2015

Emotional health refers to the overall psychological well-being of a person. Prolonged disturbanc... more Emotional health refers to the overall psychological well-being of a person. Prolonged disturbances in the emotional state of an individual can affect their health and if left unchecked could lead to serious health disorders. Monitoring the emotional well-being of the individual becomes a vital component in the health administration. Speech and physiological signals like heart rate are affected by emotion and can be used to identify the current emotional state of a person. Combining evidences from these complementary signals would help in better discrimination of emotions. This paper proposes a multimodel approach to identify emotion using a close-talk microphone and a heart rate sensor to record the speech and heart rate parameters, respectively. Feature selection is performed on the feature set comprising features extracted from speech-like pitch, Mel-frequency cepstral coefficients, formants, jitter, shimmer, and heart beat parameters like heart rate, mean, standard deviation, root mean square of interbeat intervals, heart rate variability, etc. Emotion is individually modeled as a weighted combination of speech features and heart rate features. The performance of the models is evaluated. Score-based late fusion is used to combine the two models and to improve recognition accuracy. The combination shows improvement in performance.

Autoassociative neural networks for discrimination of chronic alcoholics using Visual Evoked Potentials

2008 International Conference on Computing, Communication and Networking, 2008

The discrimination of asymptomatic chronic alcoholics from non-alcoholics using the brain activit... more The discrimination of asymptomatic chronic alcoholics from non-alcoholics using the brain activity patterns is studied in this paper. Detection of the abnormalities in the cognitive processing ability of chronic alcoholics is essential for their rehabilitation, and also in screening them for certain jobs. The brain patterns evoked in response to visual stimuli, known as visual evoked potentials (VEP), reflect the

Identification of Duplication in Questions Posed on Knowledge Sharing Platform Quora using Machine Learning Techniques

International Journal of Innovative Technology and Exploring Engineering, 2019

Quora, an online question-answering platform has a lot of duplicate questions i.e. questions that... more Quora, an online question-answering platform has a lot of duplicate questions i.e. questions that convey the same meaning. Since it is open to all users, anyone can pose a question any number of times this increases the count of duplicate questions. This paper uses a dataset comprising of question pairs (taken from the Quora website) in different columns with an indication of whether the pair of questions are duplicates or not. Traditional comparison methods like Sequence matcher perform a letter by letter comparison without understanding the contextual information, hence they give lower accuracy. Machine learning methods predict the similarity using features extracted from the context. Both the traditional methods as well as the machine learning methods were compared in this study. The features for the machine learning methods are extracted using the Bag of Words models- Count-Vectorizer and TFIDF-Vectorizer. Among the traditional comparison methods, Sequence matcher gave the highe...

Latent Semantic Analysis for Speaker Recognition

ISCA, Oct 4, 2004

A Study on Alternative Speech Sensor

2018 International Conference on Computer, Communication, and Signal Processing (ICCCSP)

This paper presents a study on alternative speech sensor for speech processing applications. Nois... more This paper presents a study on alternative speech sensor for speech processing applications. Noise robustness is one of the major considerations in speech processing systems. In presence of noise, speech signal renders unintelligible naturally and thus degrades the performance of automatic speech recognition systems. Close-talk microphone perfums well for clean speech signals. The close-talk microphone based recognition performance fails under real non-stationary conditions and also degraded strongly by the background noise. One way of improving such a system performance is by the use of alternative sensors, which are attached to the speaker's skin and receive the uttered speech through throat or bones. There are two types of sensors namely alternative acoustic and non-acoustic sensors. First, alternative acoustic sensors are more isolated from environmental noise and pick up the speech signal in a robust manner. Second is to develop noisy robust speech recognition system using Multi-sensor approach. This approach combines the information from different sources of acoustic speech sensors. The thirdinvolves non-acoustic speech sensors, which are primarily used for speaker identification task and some speech recognition applications. Fourth, discusses the speech enhancement methods for the noisy speech signal. These approaches help to improve the speech recognition system in noisy conditions and lead to building a robust ASR system.

Exploring Features for Audio Indexing

The objective of this paper is to emphasize the need for identifying and extracting suitable feat... more The objective of this paper is to emphasize the need for identifying and extracting suitable features for storage and efficient retrieval of data in the context of audio indexing. There are a wide variety of audio indexing tasks such as, identifying the speaker, language, music of different classes (Pop, Classical, Jazz, etc.), music from dierent instruments and classification of audio clips (commentary, news, football, advertisement etc.). Each type of task requires features specific to that task. Normally gross and measurable parameters or features based on amplitude, zero-crossing, bandwidth, band energy in the sub-bands, spectrum and periodicity properties are used in audio indexing applications. But it is shown that perceptually significant information of the audio data is present in the form of sequence of events. It is a challenge to extract features from these sequences of events. In this paper we demonstrate through various illustrations, the importance of residual data obt...

Speaker Recognition using Random Forest

ITM Web of Conferences

Speaker identification has become a mainstream technology in the field of machine learning that i... more Speaker identification has become a mainstream technology in the field of machine learning that involves determining the identity of a speaker from his/her speech sample. A person’s speech note contains many features that can be used to discriminate his/her identity. A model that can identify a speaker has wide applications such as biometric authentication, security, forensics and human-machine interaction. This paper implements a speaker identification system based on Random Forest as a classifier to identify the various speakers using MFCC and RPS as feature extraction techniques. The output obtained from the Random Forest classifier shows promising result. It is observed that the accuracy level is significantly higher in MFCC as compared to the RPS technique on the data taken from the well-known TIMIT corpus dataset.

Spectral Transformation of Lombard Speech to Normal Speech for Speaker Recognition Systems

In a noisy environment, the speaker tends to increase his/her vocal effort due to the hindrance i... more In a noisy environment, the speaker tends to increase his/her vocal effort due to the hindrance in the auditory self-feedback, in order to ensure effective communication. This is called Lombard effect. Lombard effect degrades the performance of speech systems that are built using normal speech due to the mismatch between the test (Lombard speech) data and training (normal speech) data. This study proposes a spectral transformation technique that maps the weighted Linear Prediction Cepstral Coefficient (wLPCC) features of the Lombard speech to that of normal speech using Multi-Layer Feed Forward Neural Network (MLFFNN). The efficiency of mapping is objectively tested using the Itakura distance metric. A text independent speaker recognition system is built in the Gaussian Mixture Model (GMM) framework using the normal speech as training data, to test the effectiveness of mapping technique. While the performance of the system when tested with Lombard speech drops to 71%, it improves si...

Exploring Features for Audio

The objective of this paper is to emphasize the need for identifying and extracting suitable feat... more The objective of this paper is to emphasize the need for identifying and extracting suitable features for storage and efficient retrieval of data in the context of audio indexing. There are a wide variety of audio indexing tasks such as, identifying the speaker, language, music of different classes (Pop, Classical, Jazz, etc.), music from dierent instruments and classification of audio clips (commentary, news, football, advertisement etc.). Each type of task requires features specific to that task. Normally gross and measurable parameters or features based on amplitude, zero-crossing, bandwidth, band energy in the sub-bands, spectrum and periodicity properties are used in audio indexing applications. But it is shown that perceptually significant information of the audio data is present in the form of sequence of events. It is a challenge to extract features from these sequences of events. In this paper we demonstrate through various illustrations, the importance of residual data obtained after removing the predictable part in the audio data. This residual data seems to contain perceptually significant information in the audio data. But it is difficult to extract that information using known signal processing algorithms.

A new multi-stream approach using acoustic and visual features for robust speech recognition system

Materials Today: Proceedings, 2022

A CNN based Speaker Recognition System using an Alternate Bone Microphone

International Journal of Innovative Technology and Exploring Engineering, 2019

State-of-art speaker recognition system uses acoustic microphone speech to identify/verify a spea... more State-of-art speaker recognition system uses acoustic microphone speech to identify/verify a speaker. The multimodal speaker recognition system includes modality of input data recorded using sources like acoustics mic,array mic ,throat mic, bone mic and video recorder. In this paper we implemented a multi-modal speaker identification system with three modality of speech as input, recorded from different microphones like air mic, throat mic and bone mic . we propose and claim an alternate way of recording the bone speech using a throat microphone and the results of a implemented speaker recognition using CNN and spectrogram is presented. The obtained results supports our claim to use the throat microphone as suitable mic to record the bone conducted speech and the accuracy of the speaker recognition system with signal speech recorded from air microphone get improved about 10% after including the other modality of speech like throat and bone speech along with the air conducted speech.

Improving Recognition of Speech System Using Multimodal Approach

International Conference on Innovative Computing and Communications, 2018

Building an ASR system in adverse conditions is a challenging task. The performance of the ASR sy... more Building an ASR system in adverse conditions is a challenging task. The performance of the ASR system is high in clean environments. However, the variabilities such as speaker effect, transmission effect, and the environmental conditions degrade the recognition performance of the system. One way to enhance the robustness of ASR system is to use multiple sources of information about speech. In this work, two sources of additional information on speech are used to build a multimodal ASR system. A throat microphone speech and visual lip reading which is less susceptible to noise acts as alternate sources of information. Mel-frequency cepstral features are extracted from the throat signal and modeled by HMM. Pixel-based transformation methods (DCT and DWT) are used to extract the features from the viseme of the video data and modeled by HMM. Throat and visual features are combined at the feature level. The proposed system has improved recognition accuracy compared to unimodals. The digit database for the English language is used for the study. The experiments are carried out for both unimodal systems and the combined systems. The combined feature of normal and throat microphone gives 86.5% recognition accuracy. Visual speech features with the normal microphone combination produce 84% accuracy. The proposed work (combines normal, throat, and visual features) shows 94% recognition accuracy which is better compared to unimodal and bimodoal ASR systems.

Enhancing Noisy Speech using WEMD

Understanding Lombard speech: a review of compensation techniques towards improving speech based recognition systems

Artificial Intelligence Review, 2020

Building voice-based Artificial Intelligence (AI) systems that can efficiently interact with huma... more Building voice-based Artificial Intelligence (AI) systems that can efficiently interact with humans through speech has become plausible today due to rapid strides in efficient data-driven AI techniques. Such a human–machine voice interaction in real world would often involve a noisy ambience, where humans tend to speak with additional vocal effort than in a quiet ambience, to mitigate the noise-induced suppression of vocal self-feedback. This noise induced change in the vocal effort is called Lombard speech. In order to build intelligent conversational devices that can operate in a noisy ambience, it is imperative to study the characteristics and processing of Lombard speech. Though the progress of research on Lombard speech started several decades ago, it needs to be explored further in the current scenario which is seeing an explosion of voice-driven applications. The system designed to work with normal speech spoken in a quiet ambience fails to provide the same performance in changing environmental contexts. Different contexts lead to different styles of Lombard speech and hence there arises a need for efficient ways of handling variations in speaking styles in noise. The Lombard speech is also more intelligible than normal speech of a speaker. Applications like public announcement systems with speech output interface should talk with varying degrees of vocal effort to enhance naturalness in a way that humans adapt to speak in noise, in real time. This review article is an attempt to summarize the progress of work on the possible ways of processing Lombard speech to build smart and robust human–machine interactive systems with speech input–output interface, irrespective of operating environmental contexts, for different application needs. This article is a comprehensive review of the studies on Lombard speech, highlighting the key differences observed in acoustic and perceptual analysis of Lombard speech and detailing the Lombard effect compensation methods towards improving the robustness of speech based recognition systems.

Advertisement Recommendation Engine – Improving YouTube Advertisement Services

Regular, 2020

Ever since its early inception in the year 2005, YouTube has been growing exponentially in terms ... more Ever since its early inception in the year 2005, YouTube has been growing exponentially in terms of personnel and popularity, to provide video streaming services that allow users to freely utilize the platform. Initiating an advertisementbased revenue system to monetize the site by the year 2007, the Google Inc. based company has been improving the system to provide the users with advertisements on them. In this article, 7 recommendation engines are developed and compared with each other, to determine the efficiency and the user specificity of each engine. From the experiments and user-based testing conducted, it is observed that the engine that recommends advertisements utilizing the objects and the texts recognized, along with the video watch history, performs the best, by recommending the most relevant advertisements in 90% of the testing scenario.

Visual Speech Recognition using Fusion of Motion and Geometric Features

Procedia Computer Science, 2020

Epileptic Seizure Prediction Through Machine Learning and Spatio-Temporal Features Based Time Series Analysis of Intracranial Electroencephalogram Data

International Journal of Engineering and Advanced Technology, 2019

Epilepsy is a group of neurological disorders identifiable by infrequent but recurrent seizures. ... more Epilepsy is a group of neurological disorders identifiable by infrequent but recurrent seizures. Seizure prediction is widely recognized as a significant problem in the neuroscience domain. Developing a Brain-Computer Interface (BCI) for seizure prediction can provide an alert to the patient, providing a buffer time to get the necessary emergency medication or at least be able to call for help, thus improving the quality of life of the patients. A considerable number of clinical studies presented evidence of symptoms (patterns) before seizure episodes and thus, there is large research on seizure prediction, however, there is very little existing literature that illustrates the use of structured processes in machine learning for predicting seizures. Limited training data and class imbalance (EEG segments corresponding to preictal phase, the duration just before the seizure, to about an hour prior to the episode, are usually in a tiny minority) are a few challenges that need to be add...

An analysis of the effect of combining standard and alternate sensor signals on recognition of syllabic units for multimodal speech recognition

Pattern Recognition Letters, 2017

A person identification system combining recognition of face and lip-read passwords

2015 International Conference on Computing and Network Communications (CoCoNet), 2015

This paper presents a person identification system which combines recognition of facial features ... more This paper presents a person identification system which combines recognition of facial features as well as spoken word using visual features alone. It incorporates a face recognition algorithm to identify the person, followed by spoken word recognition of `lip-read' password. For face recognition, PCA is used for feature extraction, followed by a KNN based classification on the reduced dimensionality features. Spoken word recognition of passwords is performed using a Visual Lip reading (Visual ASR). The visual features corresponding to the spoken word is extracted using DWT, which are then recognized using a HMM based approaches. Since evidences from face recognition and visual lip reading could be complementary in nature, the scores from the two modalities are combined. Based on the combined evidences, decision making is for person identification is carried out. The performance for face identification is 90% while the accuracy for visual speech recognition is 72%. By combining these evidences, an improved accuracy 98% is achieved.

An Improved Visual Speech Recognition of Isolated Words using Combined Pixel and Geometric Features

Indian Journal of Science and Technology, 2016

Automated Health Monitoring Through Emotion Identification

Advances in Intelligent Systems and Computing, 2015

Emotional health refers to the overall psychological well-being of a person. Prolonged disturbanc... more Emotional health refers to the overall psychological well-being of a person. Prolonged disturbances in the emotional state of an individual can affect their health and if left unchecked could lead to serious health disorders. Monitoring the emotional well-being of the individual becomes a vital component in the health administration. Speech and physiological signals like heart rate are affected by emotion and can be used to identify the current emotional state of a person. Combining evidences from these complementary signals would help in better discrimination of emotions. This paper proposes a multimodel approach to identify emotion using a close-talk microphone and a heart rate sensor to record the speech and heart rate parameters, respectively. Feature selection is performed on the feature set comprising features extracted from speech-like pitch, Mel-frequency cepstral coefficients, formants, jitter, shimmer, and heart beat parameters like heart rate, mean, standard deviation, root mean square of interbeat intervals, heart rate variability, etc. Emotion is individually modeled as a weighted combination of speech features and heart rate features. The performance of the models is evaluated. Score-based late fusion is used to combine the two models and to improve recognition accuracy. The combination shows improvement in performance.

Autoassociative neural networks for discrimination of chronic alcoholics using Visual Evoked Potentials

2008 International Conference on Computing, Communication and Networking, 2008

The discrimination of asymptomatic chronic alcoholics from non-alcoholics using the brain activit... more The discrimination of asymptomatic chronic alcoholics from non-alcoholics using the brain activity patterns is studied in this paper. Detection of the abnormalities in the cognitive processing ability of chronic alcoholics is essential for their rehabilitation, and also in screening them for certain jobs. The brain patterns evoked in response to visual stimuli, known as visual evoked potentials (VEP), reflect the

Identification of Duplication in Questions Posed on Knowledge Sharing Platform Quora using Machine Learning Techniques

International Journal of Innovative Technology and Exploring Engineering, 2019

Quora, an online question-answering platform has a lot of duplicate questions i.e. questions that... more Quora, an online question-answering platform has a lot of duplicate questions i.e. questions that convey the same meaning. Since it is open to all users, anyone can pose a question any number of times this increases the count of duplicate questions. This paper uses a dataset comprising of question pairs (taken from the Quora website) in different columns with an indication of whether the pair of questions are duplicates or not. Traditional comparison methods like Sequence matcher perform a letter by letter comparison without understanding the contextual information, hence they give lower accuracy. Machine learning methods predict the similarity using features extracted from the context. Both the traditional methods as well as the machine learning methods were compared in this study. The features for the machine learning methods are extracted using the Bag of Words models- Count-Vectorizer and TFIDF-Vectorizer. Among the traditional comparison methods, Sequence matcher gave the highe...

Latent Semantic Analysis for Speaker Recognition

ISCA, Oct 4, 2004

A Study on Alternative Speech Sensor

2018 International Conference on Computer, Communication, and Signal Processing (ICCCSP)

This paper presents a study on alternative speech sensor for speech processing applications. Nois... more This paper presents a study on alternative speech sensor for speech processing applications. Noise robustness is one of the major considerations in speech processing systems. In presence of noise, speech signal renders unintelligible naturally and thus degrades the performance of automatic speech recognition systems. Close-talk microphone perfums well for clean speech signals. The close-talk microphone based recognition performance fails under real non-stationary conditions and also degraded strongly by the background noise. One way of improving such a system performance is by the use of alternative sensors, which are attached to the speaker's skin and receive the uttered speech through throat or bones. There are two types of sensors namely alternative acoustic and non-acoustic sensors. First, alternative acoustic sensors are more isolated from environmental noise and pick up the speech signal in a robust manner. Second is to develop noisy robust speech recognition system using Multi-sensor approach. This approach combines the information from different sources of acoustic speech sensors. The thirdinvolves non-acoustic speech sensors, which are primarily used for speaker identification task and some speech recognition applications. Fourth, discusses the speech enhancement methods for the noisy speech signal. These approaches help to improve the speech recognition system in noisy conditions and lead to building a robust ASR system.

Exploring Features for Audio Indexing

The objective of this paper is to emphasize the need for identifying and extracting suitable feat... more The objective of this paper is to emphasize the need for identifying and extracting suitable features for storage and efficient retrieval of data in the context of audio indexing. There are a wide variety of audio indexing tasks such as, identifying the speaker, language, music of different classes (Pop, Classical, Jazz, etc.), music from dierent instruments and classification of audio clips (commentary, news, football, advertisement etc.). Each type of task requires features specific to that task. Normally gross and measurable parameters or features based on amplitude, zero-crossing, bandwidth, band energy in the sub-bands, spectrum and periodicity properties are used in audio indexing applications. But it is shown that perceptually significant information of the audio data is present in the form of sequence of events. It is a challenge to extract features from these sequences of events. In this paper we demonstrate through various illustrations, the importance of residual data obt...

Speaker Recognition using Random Forest

ITM Web of Conferences

Speaker identification has become a mainstream technology in the field of machine learning that i... more Speaker identification has become a mainstream technology in the field of machine learning that involves determining the identity of a speaker from his/her speech sample. A person’s speech note contains many features that can be used to discriminate his/her identity. A model that can identify a speaker has wide applications such as biometric authentication, security, forensics and human-machine interaction. This paper implements a speaker identification system based on Random Forest as a classifier to identify the various speakers using MFCC and RPS as feature extraction techniques. The output obtained from the Random Forest classifier shows promising result. It is observed that the accuracy level is significantly higher in MFCC as compared to the RPS technique on the data taken from the well-known TIMIT corpus dataset.

Spectral Transformation of Lombard Speech to Normal Speech for Speaker Recognition Systems

In a noisy environment, the speaker tends to increase his/her vocal effort due to the hindrance i... more In a noisy environment, the speaker tends to increase his/her vocal effort due to the hindrance in the auditory self-feedback, in order to ensure effective communication. This is called Lombard effect. Lombard effect degrades the performance of speech systems that are built using normal speech due to the mismatch between the test (Lombard speech) data and training (normal speech) data. This study proposes a spectral transformation technique that maps the weighted Linear Prediction Cepstral Coefficient (wLPCC) features of the Lombard speech to that of normal speech using Multi-Layer Feed Forward Neural Network (MLFFNN). The efficiency of mapping is objectively tested using the Itakura distance metric. A text independent speaker recognition system is built in the Gaussian Mixture Model (GMM) framework using the normal speech as training data, to test the effectiveness of mapping technique. While the performance of the system when tested with Lombard speech drops to 71%, it improves si...

Exploring Features for Audio

The objective of this paper is to emphasize the need for identifying and extracting suitable feat... more The objective of this paper is to emphasize the need for identifying and extracting suitable features for storage and efficient retrieval of data in the context of audio indexing. There are a wide variety of audio indexing tasks such as, identifying the speaker, language, music of different classes (Pop, Classical, Jazz, etc.), music from dierent instruments and classification of audio clips (commentary, news, football, advertisement etc.). Each type of task requires features specific to that task. Normally gross and measurable parameters or features based on amplitude, zero-crossing, bandwidth, band energy in the sub-bands, spectrum and periodicity properties are used in audio indexing applications. But it is shown that perceptually significant information of the audio data is present in the form of sequence of events. It is a challenge to extract features from these sequences of events. In this paper we demonstrate through various illustrations, the importance of residual data obtained after removing the predictable part in the audio data. This residual data seems to contain perceptually significant information in the audio data. But it is difficult to extract that information using known signal processing algorithms.