Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
nayeemulla khan
  • Vellore, Tamil Nadu, India

nayeemulla khan

VIT University, SCSE, Faculty Member
State-of-art speaker recognition system uses acoustic microphone speech to identify/verify a speaker. The multimodal speaker recognition system includes modality of input data recorded using sources like acoustics mic,array mic ,throat... more
State-of-art speaker recognition system uses acoustic microphone speech to identify/verify a speaker. The multimodal speaker recognition system includes modality of input data recorded using sources like acoustics mic,array mic ,throat mic, bone mic and video recorder. In this paper we implemented a multi-modal speaker identification system with three modality of speech as input, recorded from different microphones like air mic, throat mic and bone mic . we propose and claim an alternate way of recording the bone speech using a throat microphone and the results of a implemented speaker recognition using CNN and spectrogram is presented. The obtained results supports our claim to use the throat microphone as suitable mic to record the bone conducted speech and the accuracy of the speaker recognition system with signal speech recorded from air microphone get improved about 10% after including the other modality of speech like throat and bone speech along with the air conducted speech.
Building an ASR system in adverse conditions is a challenging task. The performance of the ASR system is high in clean environments. However, the variabilities such as speaker effect, transmission effect, and the environmental conditions... more
Building an ASR system in adverse conditions is a challenging task. The performance of the ASR system is high in clean environments. However, the variabilities such as speaker effect, transmission effect, and the environmental conditions degrade the recognition performance of the system. One way to enhance the robustness of ASR system is to use multiple sources of information about speech. In this work, two sources of additional information on speech are used to build a multimodal ASR system. A throat microphone speech and visual lip reading which is less susceptible to noise acts as alternate sources of information. Mel-frequency cepstral features are extracted from the throat signal and modeled by HMM. Pixel-based transformation methods (DCT and DWT) are used to extract the features from the viseme of the video data and modeled by HMM. Throat and visual features are combined at the feature level. The proposed system has improved recognition accuracy compared to unimodals. The digit database for the English language is used for the study. The experiments are carried out for both unimodal systems and the combined systems. The combined feature of normal and throat microphone gives 86.5% recognition accuracy. Visual speech features with the normal microphone combination produce 84% accuracy. The proposed work (combines normal, throat, and visual features) shows 94% recognition accuracy which is better compared to unimodal and bimodoal ASR systems.
Building voice-based Artificial Intelligence (AI) systems that can efficiently interact with humans through speech has become plausible today due to rapid strides in efficient data-driven AI techniques. Such a human–machine voice... more
Building voice-based Artificial Intelligence (AI) systems that can efficiently interact with humans through speech has become plausible today due to rapid strides in efficient data-driven AI techniques. Such a human–machine voice interaction in real world would often involve a noisy ambience, where humans tend to speak with additional vocal effort than in a quiet ambience, to mitigate the noise-induced suppression of vocal self-feedback. This noise induced change in the vocal effort is called Lombard speech. In order to build intelligent conversational devices that can operate in a noisy ambience, it is imperative to study the characteristics and processing of Lombard speech. Though the progress of research on Lombard speech started several decades ago, it needs to be explored further in the current scenario which is seeing an explosion of voice-driven applications. The system designed to work with normal speech spoken in a quiet ambience fails to provide the same performance in changing environmental contexts. Different contexts lead to different styles of Lombard speech and hence there arises a need for efficient ways of handling variations in speaking styles in noise. The Lombard speech is also more intelligible than normal speech of a speaker. Applications like public announcement systems with speech output interface should talk with varying degrees of vocal effort to enhance naturalness in a way that humans adapt to speak in noise, in real time. This review article is an attempt to summarize the progress of work on the possible ways of processing Lombard speech to build smart and robust human–machine interactive systems with speech input–output interface, irrespective of operating environmental contexts, for different application needs. This article is a comprehensive review of the studies on Lombard speech, highlighting the key differences observed in acoustic and perceptual analysis of Lombard speech and detailing the Lombard effect compensation methods towards improving the robustness of speech based recognition systems.
Ever since its early inception in the year 2005, YouTube has been growing exponentially in terms of personnel and popularity, to provide video streaming services that allow users to freely utilize the platform. Initiating an... more
Ever since its early inception in the year 2005, YouTube has been growing exponentially in terms of personnel and popularity, to provide video streaming services that allow users to freely utilize the platform. Initiating an advertisementbased revenue system to monetize the site by the year 2007, the Google Inc. based company has been improving the system to provide the users with advertisements on them. In this article, 7 recommendation engines are developed and compared with each other, to determine the efficiency and the user specificity of each engine. From the experiments and user-based testing conducted, it is observed that the engine that recommends advertisements utilizing the objects and the texts recognized, along with the video watch history, performs the best, by recommending the most relevant advertisements in 90% of the testing scenario.
Epilepsy is a group of neurological disorders identifiable by infrequent but recurrent seizures. Seizure prediction is widely recognized as a significant problem in the neuroscience domain. Developing a Brain-Computer Interface (BCI) for... more
Epilepsy is a group of neurological disorders identifiable by infrequent but recurrent seizures. Seizure prediction is widely recognized as a significant problem in the neuroscience domain. Developing a Brain-Computer Interface (BCI) for seizure prediction can provide an alert to the patient, providing a buffer time to get the necessary emergency medication or at least be able to call for help, thus improving the quality of life of the patients. A considerable number of clinical studies presented evidence of symptoms (patterns) before seizure episodes and thus, there is large research on seizure prediction, however, there is very little existing literature that illustrates the use of structured processes in machine learning for predicting seizures. Limited training data and class imbalance (EEG segments corresponding to preictal phase, the duration just before the seizure, to about an hour prior to the episode, are usually in a tiny minority) are a few challenges that need to be add...
This paper presents a person identification system which combines recognition of facial features as well as spoken word using visual features alone. It incorporates a face recognition algorithm to identify the person, followed by spoken... more
This paper presents a person identification system which combines recognition of facial features as well as spoken word using visual features alone. It incorporates a face recognition algorithm to identify the person, followed by spoken word recognition of `lip-read' password. For face recognition, PCA is used for feature extraction, followed by a KNN based classification on the reduced dimensionality features. Spoken word recognition of passwords is performed using a Visual Lip reading (Visual ASR). The visual features corresponding to the spoken word is extracted using DWT, which are then recognized using a HMM based approaches. Since evidences from face recognition and visual lip reading could be complementary in nature, the scores from the two modalities are combined. Based on the combined evidences, decision making is for person identification is carried out. The performance for face identification is 90% while the accuracy for visual speech recognition is 72%. By combining these evidences, an improved accuracy 98% is achieved.
Emotional health refers to the overall psychological well-being of a person. Prolonged disturbances in the emotional state of an individual can affect their health and if left unchecked could lead to serious health disorders. Monitoring... more
Emotional health refers to the overall psychological well-being of a person. Prolonged disturbances in the emotional state of an individual can affect their health and if left unchecked could lead to serious health disorders. Monitoring the emotional well-being of the individual becomes a vital component in the health administration. Speech and physiological signals like heart rate are affected by emotion and can be used to identify the current emotional state of a person. Combining evidences from these complementary signals would help in better discrimination of emotions. This paper proposes a multimodel approach to identify emotion using a close-talk microphone and a heart rate sensor to record the speech and heart rate parameters, respectively. Feature selection is performed on the feature set comprising features extracted from speech-like pitch, Mel-frequency cepstral coefficients, formants, jitter, shimmer, and heart beat parameters like heart rate, mean, standard deviation, root mean square of interbeat intervals, heart rate variability, etc. Emotion is individually modeled as a weighted combination of speech features and heart rate features. The performance of the models is evaluated. Score-based late fusion is used to combine the two models and to improve recognition accuracy. The combination shows improvement in performance.
The discrimination of asymptomatic chronic alcoholics from non-alcoholics using the brain activity patterns is studied in this paper. Detection of the abnormalities in the cognitive processing ability of chronic alcoholics is essential... more
The discrimination of asymptomatic chronic alcoholics from non-alcoholics using the brain activity patterns is studied in this paper. Detection of the abnormalities in the cognitive processing ability of chronic alcoholics is essential for their rehabilitation, and also in screening them for certain jobs. The brain patterns evoked in response to visual stimuli, known as visual evoked potentials (VEP), reflect the
Quora, an online question-answering platform has a lot of duplicate questions i.e. questions that convey the same meaning. Since it is open to all users, anyone can pose a question any number of times this increases the count of duplicate... more
Quora, an online question-answering platform has a lot of duplicate questions i.e. questions that convey the same meaning. Since it is open to all users, anyone can pose a question any number of times this increases the count of duplicate questions. This paper uses a dataset comprising of question pairs (taken from the Quora website) in different columns with an indication of whether the pair of questions are duplicates or not. Traditional comparison methods like Sequence matcher perform a letter by letter comparison without understanding the contextual information, hence they give lower accuracy. Machine learning methods predict the similarity using features extracted from the context. Both the traditional methods as well as the machine learning methods were compared in this study. The features for the machine learning methods are extracted using the Bag of Words models- Count-Vectorizer and TFIDF-Vectorizer. Among the traditional comparison methods, Sequence matcher gave the highe...
This paper presents a study on alternative speech sensor for speech processing applications. Noise robustness is one of the major considerations in speech processing systems. In presence of noise, speech signal renders unintelligible... more
This paper presents a study on alternative speech sensor for speech processing applications. Noise robustness is one of the major considerations in speech processing systems. In presence of noise, speech signal renders unintelligible naturally and thus degrades the performance of automatic speech recognition systems. Close-talk microphone perfums well for clean speech signals. The close-talk microphone based recognition performance fails under real non-stationary conditions and also degraded strongly by the background noise. One way of improving such a system performance is by the use of alternative sensors, which are attached to the speaker's skin and receive the uttered speech through throat or bones. There are two types of sensors namely alternative acoustic and non-acoustic sensors. First, alternative acoustic sensors are more isolated from environmental noise and pick up the speech signal in a robust manner. Second is to develop noisy robust speech recognition system using Multi-sensor approach. This approach combines the information from different sources of acoustic speech sensors. The thirdinvolves non-acoustic speech sensors, which are primarily used for speaker identification task and some speech recognition applications. Fourth, discusses the speech enhancement methods for the noisy speech signal. These approaches help to improve the speech recognition system in noisy conditions and lead to building a robust ASR system.
The objective of this paper is to emphasize the need for identifying and extracting suitable features for storage and efficient retrieval of data in the context of audio indexing. There are a wide variety of audio indexing tasks such as,... more
The objective of this paper is to emphasize the need for identifying and extracting suitable features for storage and efficient retrieval of data in the context of audio indexing. There are a wide variety of audio indexing tasks such as, identifying the speaker, language, music of different classes (Pop, Classical, Jazz, etc.), music from dierent instruments and classification of audio clips (commentary, news, football, advertisement etc.). Each type of task requires features specific to that task. Normally gross and measurable parameters or features based on amplitude, zero-crossing, bandwidth, band energy in the sub-bands, spectrum and periodicity properties are used in audio indexing applications. But it is shown that perceptually significant information of the audio data is present in the form of sequence of events. It is a challenge to extract features from these sequences of events. In this paper we demonstrate through various illustrations, the importance of residual data obt...
Speaker identification has become a mainstream technology in the field of machine learning that involves determining the identity of a speaker from his/her speech sample. A person’s speech note contains many features that can be used to... more
Speaker identification has become a mainstream technology in the field of machine learning that involves determining the identity of a speaker from his/her speech sample. A person’s speech note contains many features that can be used to discriminate his/her identity. A model that can identify a speaker has wide applications such as biometric authentication, security, forensics and human-machine interaction. This paper implements a speaker identification system based on Random Forest as a classifier to identify the various speakers using MFCC and RPS as feature extraction techniques. The output obtained from the Random Forest classifier shows promising result. It is observed that the accuracy level is significantly higher in MFCC as compared to the RPS technique on the data taken from the well-known TIMIT corpus dataset.
In a noisy environment, the speaker tends to increase his/her vocal effort due to the hindrance in the auditory self-feedback, in order to ensure effective communication. This is called Lombard effect. Lombard effect degrades the... more
In a noisy environment, the speaker tends to increase his/her vocal effort due to the hindrance in the auditory self-feedback, in order to ensure effective communication. This is called Lombard effect. Lombard effect degrades the performance of speech systems that are built using normal speech due to the mismatch between the test (Lombard speech) data and training (normal speech) data. This study proposes a spectral transformation technique that maps the weighted Linear Prediction Cepstral Coefficient (wLPCC) features of the Lombard speech to that of normal speech using Multi-Layer Feed Forward Neural Network (MLFFNN). The efficiency of mapping is objectively tested using the Itakura distance metric. A text independent speaker recognition system is built in the Gaussian Mixture Model (GMM) framework using the normal speech as training data, to test the effectiveness of mapping technique. While the performance of the system when tested with Lombard speech drops to 71%, it improves si...
The objective of this paper is to emphasize the need for identifying and extracting suitable features for storage and efficient retrieval of data in the context of audio indexing. There are a wide variety of audio indexing tasks such as,... more
The objective of this paper is to emphasize the need for identifying and extracting suitable features for storage and efficient retrieval of data in the context of audio indexing. There are a wide variety of audio indexing tasks such as, identifying the speaker, language, music of different classes (Pop, Classical, Jazz, etc.), music from dierent instruments and classification of audio clips (commentary, news, football, advertisement etc.). Each type of task requires features specific to that task. Normally gross and measurable parameters or features based on amplitude, zero-crossing, bandwidth, band energy in the sub-bands, spectrum and periodicity properties are used in audio indexing applications. But it is shown that perceptually significant information of the audio data is present in the form of sequence of events. It is a challenge to extract features from these sequences of events. In this paper we demonstrate through various illustrations, the importance of residual data obtained after removing the predictable part in the audio data. This residual data seems to contain perceptually significant information in the audio data. But it is difficult to extract that information using known signal processing algorithms.
Research work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech... more
Research work on the design of robust multimodal speech recognition systems making use of acoustic and visual cues, extracted using the relatively noise robust alternate speech sensors is gaining interest in recent times among the speech processing research fraternity. The primary objective of this work is to study the exclusive influence of Lombard effect on the automatic recognition of the confusable syllabic consonant-vowel units of Hindi language, as a step towards building robust multimodal ASR systems in adverse environments in the context of Indian languages which are syllabic in nature. The dataset for this work comprises the confusable 145 consonant-vowel (CV) syllabic units of Hindi language recorded simultaneously using three modalities that capture the acoustic and visual speech cues, namely normal acoustic microphone (NM), throat microphone (TM) and a camera that captures the associated lip movements. The Lombard effect is induced by feeding crowd noise into the speaker...
Transient Evoked Otoacoustic Emissions (TEOAE) are a class of oto-acoustic emissions that are generated by the cochlea in response to an external stimulus. The TEOAE signals exhibit characteristics unique to an individual, and are... more
Transient Evoked Otoacoustic Emissions (TEOAE) are a class of oto-acoustic emissions that are generated by the cochlea in response to an external stimulus. The TEOAE signals exhibit characteristics unique to an individual, and are therefore considered as a potential biometric modality. Unlike conventional modalities, TEOAE is immune to replay and falsification attacks due to its implicit liveliness detection feature. In this paper, we propose an efficient deep neural network architecture, EarNet, to learn the appropriate filters for non-stationary (TEOAE) signals, which can reveal individual uniqueness and long- term reproducibility. EarNet is inspired by Google’s FaceNet. Furthermore, the embeddings generated by EarNet, in the Euclidean space, are such that they reduce intra-subject variability while capturing inter-subject variability, as visualized using t-SNE. The embeddings from EarNet are used for identification and verification tasks. The K-Nearest Neighbour classifier gives identification accuracies of 99.21% and 99.42% for the left and right ear, respectively, which are highest among the machine learning algorithms explored in this work. The verification using Pearson correlation on the embeddings performs with an EER of 0.581% and 0.057% for the left and right ear, respectively, scoring better than all other techniques. Fusion strategy yields an improved identification accuracy of 99.92%. The embeddings generalize well on subjects that are not part of the training, and hence EarNet is scalable on any new larger dataset.
The robustness of automatic speech recognition (ASR) systems degrade due to the factors such as environmental noises, speaker variability, and channel distortion, among others. The approaches such as speech signal processing, model... more
The robustness of automatic speech recognition (ASR) systems degrade due to the factors such as environmental noises, speaker variability, and channel distortion, among others. The approaches such as speech signal processing, model adaptation, hybrid techniques and integration of multiple sources are used for ASR system development. This paper focuses on building a robust ASR system by combining the complementary evidence present is the multiple modalities through which speech is expressed. Speech sounds are produced with lip radiation accompanied lip movements called Visual Speech Recognition (VSR). VSR system converts lip movement into spoken words. This system consists of lip region detection, visual speech feature extraction method and modeling techniques. Robust feature extraction from visual lip movement is a challenging task in VSR system. Hence, this paper reviews the feature extraction methods and existing databases used for VSR system. The fusion of visual lip movements with ASR system at different levels is also presented.
The fourth industrial revolution, named as, Industry 4.0 encapsulates the industrial productions oriented towards an intelligent cum autonomous manufacturing process, that in turn, depends upon cyber physical systems, design cum... more
The fourth industrial revolution, named as, Industry 4.0 encapsulates the industrial productions oriented towards an intelligent cum autonomous manufacturing process, that in turn, depends upon cyber physical systems, design cum development of cyber physical production systems whose operations are monitored, coordinated, controlled and integrated by a computing & communication core. So, for an effective implementation of Industry 4.0, Engineers should have expanded design skills that cover interoperability, virtualisation, decentralisation, Real time capability, service orientation, modularity etc. along with the information technology skills. So, Engineering Education which generates engineers for Industry 4.0, referred as Engineering Education 4.0 (EE 4.0) should get transformed to meet the demands of Industry 4.0, that stresses the integration of all the engineering disciplines. Instead of the present discipline-dependent curriculum, this paper proposes a discipline-independent framework for the curriculum of EE 4.0, in which all the engineering disciplines gets amalgamated, to generate a unique discipline called Engineering 4.0. Every Engineer who comes out of the proposed curriculum will have the basic skills of all the disciplines, social responsibility and professional ethics. Learning outcomes, measured in the usual way with the standard tools, are mapped with the skill-requirement of industry 4.0.
Epilepsy is a chronic neurological disorder that affects the function of the brain in people of all ages. It manifests in the electroencephalogram (EEG) signal which records the electrical activity of the brain. Various image processing,... more
Epilepsy is a chronic neurological disorder that affects the function of the brain in people of all ages. It manifests in the electroencephalogram (EEG) signal which records the electrical activity of the brain. Various image processing, signal processing, and machine-learning based techniques are employed to analyze epilepsy, using spatial and temporal features. The nervous system that generates the EEG signal is considered nonlinear and the EEG signals exhibit chaotic behavior. In order to capture these nonlinear dynamics, we use reconstructed phase space (RPS) representation of the signal. Earlier studies have primarily addressed seizure detection as a binary classification (normal vs. ictal) problem and rarely as a ternary class (normal vs. interictal vs. ictal) problem. We employ transfer learning on a pre-trained deep neural network model and retrain it using RPS images of the EEG signal. The classification accuracy of the model for the binary classes is (98.5±1.5)% and (95±2)% for the ternary classes. The performance of the convolution neural network (CNN) model is better than the other existing statistical approach for all performance indicators such as accuracy, sensitivity, and specificity. The result of the proposed approach shows the prospect of employing RPS images with CNN for predicting epileptic seizures.
Abstract In this paper, we analyse segmented speech phonemes with Convolutional filters, after embedding them in Reconstructed Phase Space (RPS). These feature extracting Convolutional filters are trained on the embedded speech data from... more
Abstract In this paper, we analyse segmented speech phonemes with Convolutional filters, after embedding them in Reconstructed Phase Space (RPS). These feature extracting Convolutional filters are trained on the embedded speech data from scratch and are also fine-tuned from networks trained with other data. Reconstruction of Phase Space portrays the dynamics of an observed system as a geometric representation. We present a study highlighting the discriminative capacity of the features extracted through Convolutional Neural Network (CNN) from the textural pattern and shape of this geometric representation. CNNs are heavily used in image-related tasks, but have not seen application on phase space portraits, possibly due to the higher dimensionality of the embedding. However, we find that the application of CNN on restricted bi-dimensional RPS, characterizes the space well than prior methods on high dimensional embeddings. We show experimental results supporting the use of RPS with CNN (RPS-CNN) for phoneme classification. The results affirm that essential signal characteristics are automatically quantified from the phase portraits of speech and can be used in place of conventional techniques involving frequency domain transformations.
Speech recognition involves transforming input speech sounds into a sequence of units called symbols and converting the symbol sequence into a text corresponding to the message in the speech signal. The knowledge of the number of legal... more
Speech recognition involves transforming input speech sounds into a sequence of units called symbols and converting the symbol sequence into a text corresponding to the message in the speech signal. The knowledge of the number of legal sound units in a language, their relative frequency of occurrences and other statistical aspects of the sound units may be useful to improve the performance of the overall system. In this paper we briey describe the speech database developed for two Indian languages, Tamil and Telugu. We analyze the relative frequency of the sound units and grouping of Consonant Vowel (CV) units into different clusters. The duration of dierent sound units is also studied. The duration information of the dynamic nature of the sound units plays an important role in providing naturalness in text-to-speech synthesis. Finally, the advantage of the statistics for speech recognition and spoken language identi cation is highlighted. 1.
The performance of Automatic Speech recognition system (ASR) built using close talk microphones degrades in noisy environments. AS R built using Throat Microphone (TM) speech shows relatively better performance under such adverse... more
The performance of Automatic Speech recognition system (ASR) built using close talk microphones degrades in noisy environments. AS R built using Throat Microphone (TM) speech shows relatively better performance under such adverse situations. However, some of the sounds are not well captured in TM. In this work we explore the combined use of Normal Microphone (NM) and TM features to improve the recognition rate of AS R. In the proposed work, the combined Mel-Frequency Cepstral Coefficients (MFCC) derived from the two signals are used to built an AS R in the HMM framework to recognize the 145 syllabic units of Indian language Hindi. The performance of this combined AS R system shows a significant improvement in performance when compared with individual AS R systems built using NM and TM features, respectively.
Retrieval of relevant information is becoming increasingly difficult owing to the presence of an ocean of information in the World Wide Web. Users in need of quick access to specific information are sub-jected to a series of web... more
Retrieval of relevant information is becoming increasingly difficult owing to the presence of an ocean of information in the World Wide Web. Users in need of quick access to specific information are sub-jected to a series of web re-directions before finally arriving at the page that contains the required information. In this paper, an optimal voice based web content retrieval system is proposed that makes use of an open source speech recognition engine to deal with voice inputs. The proposed system performs a quicker retrieval of relevant content from Wikipedia and instantly presents the textual information along with the related image to the user. This search is faster than the conventional web content retrieval technique. The current system is built with limited vocabulary but can be extended to support a larger vocabulary. Additionally, the system is also scalable to retrieve content from few other sources of information apart from Wikipedia.
The objective of this paper is to improve the performance of the speaker recognition system by combining speaker specific evidences present in the spectral characteristics of the standard microphone speech and the throat microphone... more
The objective of this paper is to improve the performance of the speaker recognition system by combining speaker specific evidences present in the spectral characteristics of the standard microphone speech and the throat microphone speech. Certain vocal tract spectral features extracted from these two speech signals are distinct and could be complimentary to one another. These features could also be speech specific as well as speaker specific. These distinguishing and complimentary nature of the spectral features are due to the difference in the placement of the two microphones. Auto associative neural networks are used to model the speaker characteristics based on the system features represented by weighted linear prediction cepstral coefficients. The speaker recognition system based on Throat Microphone (TM) spectral features is comparable (though slightly less accurate) to that based on standard (or Normal) Microphone (NM) features. By combining the evidence from both the NM and TM based systems using late integration, an improvement in performance is observed from about 91% (obtained using NM features alone) to 94% (NM and TM combined). This shows the potential of combining various other speaker specific characteristics of the NM and two speech signals for further improvement in performance.
... Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. nayeem@speech.iitm.ernet.in Abstract ... 4 Page 5. For speech document, Ud = diag(Vd · IDFT d... more
... Chennai - 600036 Tamilnadu, India. chirag@speech.iitm.ernet.in A. Nayeemulla Khan Dept. of CSE IIT Madras Chennai - 600036 Tamilnadu, India. nayeem@speech.iitm.ernet.in Abstract ... 4 Page 5. For speech document, Ud = diag(Vd · IDFT d ), for d ∈ D∗ v (7) ...
In this paper, we propose a novel system for providing summaries for commercial contracts such as Non- Disclosure Agreements (NDAs), employment agreements, etc. to enable those reviewing the contract to spend less time on such reviews and... more
In this paper, we propose a novel system for providing summaries for commercial contracts such as Non- Disclosure Agreements (NDAs), employment agreements, etc. to enable those reviewing the contract to spend less time on such reviews and improve understanding as well. Since it is observed that a majority of such commercial documents are paragraphed and contain headings/topics followed by their respective content along with their context, we extract those topics and summarize them as per the user’s need. In this paper, we propose that summarizing such paragraphs/topics as per requirements is a more viable approach than summarizing the whole document. We use extractive summarization approaches for this task and compare their performance with human-written summaries. We conclude that the results of extractive techniques are satisfactory and could be improved with a large corpus of data and supervised abstractive summarization methods.
The objective of this paper is to emphasize the need for identifying and extracting suitable features for storage and ecient retrieval of data in the context of audio indexing. There are a wide variety of audio indexing tasks such as,... more
The objective of this paper is to emphasize the need for identifying and extracting suitable features for storage and ecient retrieval of data in the context of audio indexing. There are a wide variety of audio indexing tasks such as, identifying the speaker, language, music of dierent ...
Research Interests:
... ' @ ge formance for this data. Da Di Du De Do da di du de do 2 1 nsonant-Vowels Vowel subgroup /a/ / i/ /U/ 1 e/ / o/ ba bi bu be bo gha ghi ghu ghe gbo Dha Dhi Dhu Dhe Dho dha dhi dhu dhe dho bha bhi bhu bhe bho Since ...