Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
51 views

Digital Signal Processing: The Final

This document describes a speaker recognition system based on Mel Frequency Cepstral Coefficients (MFCC) that is able to identify speakers with good accuracy. The system extracts MFCC features from speech samples during an enrollment phase to create codebooks for each known speaker. During testing, MFCC features are extracted from unknown speech and compared to the codebooks using vector quantization to identify the speaker with the closest matching codebook. The system was able to correctly identify 11 speakers from separate enrollment and test data, demonstrating its potential for security applications like access control through speaker verification.

Uploaded by

Akash Saraogi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Digital Signal Processing: The Final

This document describes a speaker recognition system based on Mel Frequency Cepstral Coefficients (MFCC) that is able to identify speakers with good accuracy. The system extracts MFCC features from speech samples during an enrollment phase to create codebooks for each known speaker. During testing, MFCC features are extracted from unknown speech and compared to the codebooks using vector quantization to identify the speaker with the closest matching codebook. The system was able to correctly identify 11 speakers from separate enrollment and test data, demonstrating its potential for security applications like access control through speaker verification.

Uploaded by

Akash Saraogi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ECE 2006

DIGITAL SIGNAL
PROCESSING

The Final
REPORT

BY :

• AAKASH SARAWGI 17BEC0918
SLOT:- G1+TG1
Topic :
Speaker Recognition
System Based on MFCC
Table of Content:
Sr.
Topics
No.
1 Abstract
2 Introduction
3 Data processing or Speech Feature Extraction
4 Feature Matching
5 RESULT AND CONCLUSION
6 Acknowledgement
7 References
Abstract :
• Speaker recognition is the process of automatically recognizing who
is speaking on the basis of individual information included in speech
waves. This technique makes it possible to use the speaker's voice to
verify their identity and control access to services such as voice
dialing, banking by telephone, telephone shopping, database access
services, information services, voice mail, security control for
confidential information areas, and remote access to computers. This
document describes how to build a simple, yet complete and
representative automatic speaker recognition system. Such a speaker
recognition system has potential in many security applications. For
example, users have to speak a PIN (Personal Identification Number)
in order to gain access to the laboratory door, or users have to speak
their credit card number over the telephone line to verify their
identity. By checking the voice characteristics of the input utterance,
using an automatic speaker recognition system similar to the one that
we will describe, the system is able to add an extra level of security
Introduction :
• Speaker recognition can be classified into identification and verification.
Speaker identification is the process of determining which registered speaker
provides a given utterance. Speaker verification, on the other hand, is the
process of accepting or rejecting the identity claim of a speaker. Figure 1
shows the basic structures of speaker identification and verification systems.
The system that we will describe is classified as text-independent speaker
identification system since its task is to identify the person who speaks
regardless of what is saying. At the highest level, all speaker recognition
systems contain two main modules (refer to Figure 1): feature extraction and
feature matching. Feature extraction is the process that extracts a small amount
of data from the voice signal that can later be used to represent each speaker.
Feature matching involves the actual procedure to identify the unknown
speaker by comparing extracted features from his/her voice input with the ones
from a set of known speakers.
All speaker recognition systems have to serve two distinguished phases. The
first one is referred to the enrolment or training phase, while the second one is
referred to as the operational or testing phase. In the training phase, each
registered speaker has to provide samples of their speech so that the system can
build or train a reference model for that speaker. In case of speaker verification
systems, in addition, a speaker-specific threshold is also computed from the
training samples. In the testing phase, the input speech I matched with stored
reference model(s) and a recognition decision is made.
Speaker recognition is a difficult task. Automatic speaker recognition works
base on the premise that a person’s speech exhibits characteristics that are
unique to the speaker. However this task has been challenged by the highly
variant of input speech signals. The principle source of variance is the speaker
himself/herself. Speech signals in training and testing sessions can be greatly
different due to many facts such as people voice change with time, health
conditions (e.g. the speaker has a cold), speaking rates, and so on. There are
also other factors, beyond speaker variability, that present a challenge to
speaker recognition technology. Examples of these are acoustical noise and
variations in recording environments (e.g. speaker uses different telephone
handsets).
Data processing or Speech Feature
Extraction
1. Introduction
The purpose of this module is to convert the speech waveform, using digital
signal processing (DSP) tools, to a set of features (at a considerably lower
information rate) for further analysis. This is often referred as the signal-
processing front end. The speech signal is a slowly timed varying signal (it is
called quasi-stationary). An example of speech signal is shown in Figure 2.
When examined over a sufficiently short period of time (between 5 and 100
msec), its characteristics are fairly stationary. However, over long periods of
time (on the order of 1/5 seconds or more) the signal characteristic change to
reflect the different speech sounds being spoken. Therefore, short-time spectral
analysis is the most common way to characterize the speech signal

A wide range of possibilities exist for parametrically representing the speech


signal for the speaker recognition task, such as Linear Prediction Coding
(LPC), Mel-Frequency Cepstrum Coefficients (MFCC), and others. MFCC is
perhaps the best known and most popular, and will be described in this paper.
MFCC’s are based on the known variation of the human ear’s critical
bandwidths with frequency, filters spaced linearly at low frequencies and
logarithmically at high frequencies have been used to capture the phonetically
important characteristics of speech. This is expressed in the mel-frequency
scale, which is a linear frequency spacing below 1000 Hz and a logarithmic
spacing above 1000 Hz.

2. Mel-frequency cepstrum coefficients processor


A block diagram of the structure of an MFCC processor is given in Figure 3.
The speech input is typically recorded at a sampling rate above 10000 Hz. This
sampling frequency was chosen to minimize the effects of aliasing in the
analog-to-digital conversion. These sampled signals can capture all frequencies
up to 5 kHz, which cover most energy of sounds that are generated by humans.
As been discussed previously, the main purpose of the MFCC processor is to
mimic the behaviour of the human ears. In addition, rather than the speech
waveforms themselves, MFFC’s are shown to be less susceptible to mentioned
variations.
Feature Matching
The problem of speaker recognition belongs to a much broader topic in scientific
and engineering so called pattern recognition. The goal of pattern recognition is to
classify objects of interest into one of a number of categories or classes. The objects
of interest are generically called patterns and in our case are sequences of acoustic
vectors that are extracted from an input speech using the techniques described in the
previous section. The classes here refer to individual speakers. Since the
classification procedure in our case is applied on extracted features, it can be also
referred to as feature matching. Furthermore, if there exists some set of patterns that
the individual classes of which are already known, then one has a problem in
supervised pattern recognition. These patterns comprise the training set and are used
to derive a classification algorithm. The remaining patterns are then used to test the
classification algorithm; these patterns are collectively referred to as the test set. If
the correct classes of the individual patterns in the test set are also known, then one
can evaluate the performance of the algorithm. The state-of-the-art in feature
matching techniques used in speaker recognition include Dynamic Time Warping
(DTW), Hidden Markov Modeling (HMM), and Vector Quantization (VQ). In this
project, the VQ approach will be used, due to ease of implementation and high
accuracy. VQ is a process of mapping vectors from a large vector space to a finite
number of regions in that space. Each region is called a cluster and can be
represented by its center called a codeword. The collection of all codewords is called
a codebook.

Figure 5 shows a conceptual diagram to illustrate this recognition process. In the


figure, only two speakers and two dimensions of the acoustic space are shown. The
circles refer to the acoustic vectors from the speaker 1 while the triangles are from the
speaker 2. In the training phase, using the clustering algorithm ,a speaker-specific VQ
codebook is generated for each known speaker by clustering his/her training acoustic
vectors. The result codewords (centroids) are shown in Figure 5 by black circles and
black triangles for speaker 1 and 2, respectively. The distance from a vector to the
closest codeword of a codebook is called a VQ-distortion. In the recognition phase,
an input utterance of an unknown voice is “vector-quantized” using each trained
codebook and the total VQ distortion is computed. The speaker corresponding to the
VQ codebook with smallest total distortion is identified as the speaker of the input
utterance.
RESULT AND CONCLUSION
• We have data of 11 speakers ,they all are speaking same word “ one” ,we
proceed our this data through above method and get the code book for each
and every speaker,which we will use as a reference for the matching.After
saving this code book. We took another speech data of same speakers and run
them in MAT-LAB in our test function to test whether our code and process is
able to identify it or not, and finally our system was able to detect and identify
each and every speaker with good accuracy. This method of feature extraction
is really very accurate and use full for various functions in Security purpose
,PIN number and various purposes as stated above.So we can create data base
from various users and that data can be used in identification purposes which
increases security in very good way.Hence this mfcc method should be
implemented in various regions for identification and this is the best method
for recognition than HMM model
Acknowledgement :
The satisfaction that accompanies with the successful completion of any
task would be incomplete without the mention of people whose ceaseless
cooperation made it possible, whose constant guidance and encouragement
crown all efforts with success.

I am grateful to my DIGITAL SIGNAL PROCESSING faculty, Prof.


ABHIJIT BHOWMIK , for his constant support, guidance, inspiration and
constructive suggestions that helped me the preparation of this report.
References :
1. Preliminary design of an ASR- maryland university,eastern shore.
2. Speech coding and recognition- university of Copenhagen.
3. Human computer interface for Kindarwanda language.
4. Hearing aids system for impaired peoples.
5. Algorithms for speech recognition and simulation in MAT-LAB University of
Gavle
6. Control of device through voice recognition using MAT-LAB
7. www.mathwork.com
8. www.cryptography.com

You might also like