Audio-visual person verification

Souheil Yacoub

Audio-visual person verification

1999, Computer Vision and …

Audio-Visual Person Verification* S. Ben-Yacoub, J. Luttin K. Jonsson, J. Matas and J. Kittler IDIAP C P 592, 1920 Martigny Switzerland University of Surrey Guildfor Surrey GU2 5XH United Kingdom {sby,luettin}@idiap.ch {eelsjk,eelkj ,ees2gm}@ee.surrey.ac .uk Abstract and an evaluation protocol are necessary. Most of the work done in multi-modal verification [7, 12, 14, 61 was tested and evaluated on small databases (less than 40 persons) or medium-sized (less than 100 persons in [51). We describe and evaluate in this paper a c o m p l e t e multi-modal user verification s y s t e m based on facial and vocal modalities. Each module of the system (face, voice, fusion) is tested and evaluated on a large database (XM2VTS database' with 295 people) according to a published protocol2. The rest of the paper is organised as follows: face and speech verification modules are described in Section 2 and 3. The multi-modal data fusion issue is presented in Section 4. The XM2VTS database and its evaluation protocol are described in Section 5. The results and different experiments are presented in section 6. In this paper we investigate benefits of classifier combination (fusion) for a multimodal system for personal identity verification. The system uses frontal face images and speech. W e show that a sophisticated fusion strategy enables the system to outperform its facial and vocal modules when taken seperately. W e show that both trained linear weighted schem.es and fusion by Support Vector Machine classifier leads to a significant reduction of total error rates. Th,e complete system is tested on data from a publicly available audio-visual database ( X M 2 V T S , 295 subjects) according to a published protocol. 1 Introduction Recognition systems based on biometric features (face, voice, iris, etc ...) have received a lot of attention in recent years Most of the proposed approaches focus on mono-modal identification. The system uses a single modality to find the closest person to the user in a database. Relatively high recognition rates were obtained for different modalities like face recognition and speaker recognition [2l, 81. Verification of person identity based on biometric informations is important for many security applications. Examples include access control to buildings, surveillance and intrusion detection. In person identity verification, the user claims a certain client identity and the system decides to accept or reject the claim. Only very low error rates can be tolerated in many of the above mentioned applications. It has been shown that combining different modalities leads to more robust systems with better performance [5]. One of the remaining questions is what strategy should be adopted for combining different modalities. In order to assess the performance of a method and compare it to other approaches, a large database 2 T;(a:, y) = (ala: - a2y + a3, a2l: + a l y + 4 (1) the error function expressing the intensity difference between a pixel s in the model image I , and its projection in the probe image Ip is defined as 'From ACTS-M2VTS project, available http://www.ee.surrey.ac.uk/Research/VSSP/xm2vts 2Available with the XM2VTS database *This work was supported by the ACTS-M2VTS project and the Swiss Federal Office for Education and Science. 580 0-7695-0149-4/99 $10.00 0 1999 IEEE Face Verification The face verification method used is based on robust correlation [ll]. Registration is achieved by direct minimization of the robustified correlation score over a multi-dimensional search space. The search space is defined by the set of all valid geometric and photometric transformations. In the current implementation method the geometric transformations are translation, scaling and rotation. Given a weak affine transformation Ta at and thus effectively improving the convergence characteristics of the method. In the training phase we employ a feature selection procedure based on minimizing the intra-class variance and at the same time maximizing the inter-class variance. A feature criterion is evaluated for each pixel and the subset of pixels that best discriminates a given client from other clients in the database (effectively modeling the impostor distribution) are selected. This feature subset is then used in verification allowing efficient identification of the probe image. The presented system runs in real-time on a highend PC. The score function used to evaluate a match between the transformed model image and the probe image is where p denotes a robust kernel. The function is the average percentage of the maximum kernel response taken over some set of pixels R. Possible kernel functions are the Huber Minimax and the Hampel (1,1,2) [9]. Experiments reported in [4] showed that the choice of kernel is not critical. In Equation (3), parameters of the score function are purely geometrical and intensity values are not transformed. In our previous work [17], we included parameters for affine compensation of global illumination changes (gain, offset) into the search space. For efficiency reasons, we decided t o adopt a less sophisticated approach in which we shift (for each point in the search space) the histogram of residual errors using the median error. To find the global extremum of the score function we employ a stochastic search technique incorporating gradient information. The gradient-based search is implemented using steepest descent on a discrete grid. Resolution of the grid is changed during the optimization (multi-resolution in the parameter domain) following a predefined schedule. The different components of the gradient (the partial derivatives with respect to the affine coefficients) are 3 Speaker Verification Speaker verification methods can be classified into text-independent and text-dependent methods. The latter usually requires that the utterances used for verification are the same as for training. These methods can exploit text-dependent voice individuality and therefore often outperform text-independent methods. We propose two different algorithms: a textindependent method based on the sphericity measure [3] and a text-dependent technique using hidden Markov models (HMM) [19]. 3.1 Text-independent Speaker Verifica- tion The first processing step aims to remove silent parts from the raw audio signal as these parts do not convey speaker dependent information. We use the speech activity detector proposed by Reynolds et al. [18] on the 16 kHz sub-sampled audio signal. The cleaned audio signal is converted to linear prediction cepstral coefficients (LPCC) [l]using the autocorrelation method. We use a pre-emphasis factor of 0.94, a Hamming window of length 25 ms, a frame interval of 10 ms, and an analysis order of 12. We have applied cepstral mean subtraction (CMS), where the mean cepstral parameter is estimated across each speech file and subtracted from each frame. The energy is normalized by mapping it to the interval [0,1] using the tangent hyperbolic function. The normalized energy is included in the feature vector, leading to 13-dimensional vectors. A client model is represented by the covariance matrix XIcomputed over the feature vectors of the client's training data. Similarly, an accessing person is represented by the covariance matrix Y , computed over that person's speech data. We use the arithmetic-harmonic sphericity measure D s p (XI ~ Y) [3] as similarity measure between the cli- where Q denotes the influence function of the robust kernel (obtained by differentiating the kernel) and s' = Ta(s). To escape from local maxima, stochastic search is performed by adding a random vector drawn from an exponential distribution (this optimization technique is effectively a special case of simulated annealing [13]). To meet real-time requirements of the verification scenario, we adopt a multi-resolution scheme in the spatial domain. This is achieved by applying the combined gradient-based and stochastic optimization described above to each level of a Gaussian pyramid. The estimate obtained on one level is used to initialize the search a t the next level. In addition to the speed-up, the multi-resolution search also has the benefit of removing local optima from the search space 581 numbers of frames Nj and sum them over all words W , which leads to the following measure: ent and the accessing person: where m denotes the dimension of the feature vector and t r ( x ) the trace of x. The similarity values were mapped to the interval [0,1] with a sigmoid function f ( D S P H ) = (1 -k e X p ( - ( D S p H - t ) ) ) - ' where f ( t ) = 0.5. A claimed speaker is rejected if SSPH < 0.5, otherwise she/he is accepted. We have used persondependent thresholds t which were estimated on the evaluation set. The processing time, on an Sun UltraSparc 30, required by the speech verification module is $ the time of the utterance duration. 3.2 This measure is calculated for the models Mc of a given client M , and for the world models M,. The following similarity: D H M M= log AXIMC) - log p(XIMW) (6) is computed and compared to a threshold t . The claiming subject is rejected if D H M M < t , otherwise she/he is accepted. The quantities D H M Mwere mapped to the interval [0,1] as described in Section 3.1. The processing time is half the time of the utterance duration. Text-dependent Speaker Verification Hidden Markov models (HMMs) represent a very efficient approach to model the statistical variations of speech in both the spectral domain and in the temporal domain. Our HMM-based verification technique makes use of 3 HMM sets: client models, world models, and silence models. Utterances of a client are represented by client HMMs. The world models serve as speaker-independent models to represent speech of an average person. They are trained on the POLYCOST3 database, which represents a distinct set of speakers, that neither includes clients nor impostors of the XM2VTS database. Finally, three silence HMMs are used to model the silent parts of the signal. The same feature extraction as in the previous section is performed. In addition, the first and second order temporal derivatives were included, leading to 42dimensional feature vectors. All models were trained based on the maximum likelihood criterion using the Baum-Welch (EM) algorithm. The world models were trained on the segmented words of the POLYCOST database, where one HMM per word was trained. For both training and verification the sentences of the XM2VTSDB are first segmented into words and silence using the world and silence models. This consists in computing the best path between the sentence and the sequence of known HMMs using the Viterbi algorithm. To do this we used an HMM network that allowed optional silence at the beginning of a sentence, between words, and at the end of a sentence. The client models could then be trained on the segmented training words. For verification, the Viterbi algorithm is used to calculate the likelihood p(Xj IMij), where X j represents the observation of the segmented word j; M j j represents the model of subject Mi and word j . We normalize the log-likelihood of word j by the 4 Multi-Modal Data Fusion Combining different experts results in a system which can outperform the experts when taken individually [15, lo]. This is especially true if the different experts are not correlated. We expect from the fusion of vision and speech to achieve better results. In the next section, we compare the Support Vector Machine (SVM) with tradition fusion methods to combine different modalities. The use of SVM is motivated by the fact that verification is basically a binary classification problem (i.e. accept or reject user) [a]. 4.1 SVM The Support Vector Machine is based on the principle of Structural Risk Minimization [20]. Classical learning approaches are designed to minimize the empirical risk (i.e error on a training set) and therefore follow the Empirical Risk Minimization principle. The SRM principle states that better generalization capabilities are achieved through a minimization of the bound on the generalization error. We assume that we have a data set V of M points in a n dimensional space belonging to two different classes +1 and -1: D = {(Xi,yi)li E { 1 . . ~ } , x Ei Rn,yi E {+1,-1}} A binary classifier should find a function f that maps the points from their data space to their label space. It has been shown [20] that the optimal separating hyperplane is expressed as: f ( ~ )= s i g n ( C aiyil((Xi, z) + 6) (7) i where K(x,y) is a positive definite symmetric function, b is a bias estimated on the training set, ai are 3For more informations see http://circwww.epfl.ch/polycost 582 t o allow the head to be easily segmented out using a technique such as chromakey. A high-quality clip-on microphone was used to record the speech. The speech sequence consisted in uttered digits from 0 to 9. the solutions of the following Quadratic Programming (QP) problem: with the constraints: aiyi = 0 and ai 2 0 xi \ 5.1 Evaluation Protocol The database was divided into three sets: training set, evaluation set, and test set (see Fig. 1). The training set is used to build client models. The evaluation set is selected to produce client and impostor access scores which are used to estimate parameters (i.e. thresholds). The estimated threshold is then used on the test set. The test set is selected to simulate real authentication tests. The three sets can also be classified with respect to subject identities into client set, impostor evaluation set, and impostor test set. For this description, each subject appears only in one set. This ensures realistic evaluation of imposter claims whose identity is unknown to the system. The protocol is where: ( i , j ) E [1..M] x [l..M] (d)i= ai (I)i= 1 ( D ) i j = YiYjIK(Xi,Xj) I The computational complexity of the SVM during the training depends on the number of data points rather than on their dimensionality. The number of computation steps is O ( n 3 ) where n is the number of data points. At run time the classification step of SVM is a simple weighted sum. The classification of 112400 claims requires 5.6sec on an Ultra-Sparc 30. 5 Session Shot Clients Impostors Figure 1: Diagram showing the partitioning of the XM2VTSDB according to protocol Configuration I. The XM2VTS database based on 295 subjects, 4 recording sessions, and two shots (repetitions) per recording sessions. The database was randomly divided into 200 clients, 25 evaluation impostors, and 70 test impostors (See [16] for the subjects' IDS of the three groups). The XM2VTSDB database contains synchronized image and speech data as well as sequences with views of rotating heads. The database includes four recordings of 295 subjects taken at one month intervals. On each visit (session) two recordings were made: a speech shot and head rotation shot. The speech shot consisted of frontal face recording of each subject during the dialogue. The database was acquired using a Sony VXlOOOE digital cam-corder and DHRlOOOUX digital VCR. Video is captured at a color sampling resolution of 4:2:0 and 16bit audio at a frequency of 32kHz. The video data is compressed at a fixed ratio of 5:l in the proprietary DV format. In total the database contains approximately 4 TBytes (4000 Gbytes) of data. When capturing the database the camera settings were kept constant across all four sessions. The head was illuminated from both left and right sides with diffusion gel sheets being used to keep this illumination as uniform as possible. A blue background was used 5.2 Performance Measures Two error measures of a verification system are the False Acceptance rate (FA) and the False Rejection rate (FR). False acceptance is the case where an impostor, claiming the identity of a client, is accepted. False rejection is the case where a client, claiming his true identity, is rejected. FA and FR are given by F A = E I / I * 100% and F R = E C / C * loo%, where E I is the number of impostor acceptances, I the number of impostor claims, E C the number of client rejections, and C the number of client claims. A trade-off between FA and FR can be controled by a threshold. For the protocol configurations, I is 112,000 (70 impostors x 8 shots x 200 clients) and C is 400 (200 clients x 2 shots). 583 The video and audio stream of each user are processed by the different verification modules. Three different modalities are considered: Face verification (Section 2), Sphericity-based speaker verification (Sec- . We performed a series of experiments to evaluate different configuration sets of modalities. The sets are defined as follows: C1: Face and HMM. 0 C2: Face, Sphericity and HMM. 0 C3: HMM and Sphericity. 0 C4: Face and Sphericity. For the SVM-based fusion, we used polynomial and gaussian kernels in our experiments. The training set was used as an evaluation set to see how performance changes with different kernel parameters. The main conclusion is that the performance does not change significantly with different polynomial. The conclusion is also valid for the gaussian kernel. We chose to run the experiments with the following configurations: 0 Linear: K(x,y) = z t y Polynomial: ~ ( xy) ,= (xty + q3 0 FA (%) 0.86 1.37 0.64 weights 0.1 0.05 0.84 0.16 0.9 0.95 F R (%) 0.25 2.5 0.25 formance speech verification modules and a medium performance vision module the conditions are violated and none of the above-mentioned fusion scheme performed better than the best individual expert (the HMM). We then considered linear weighted combination rules (also used in [la]). Optimal weights and acceptance threshold were chosen using the evaluation set. The performance of the scheme on the test set is summarized in Table 2. The results show that the trained linear classifier outperforms the linear SVM. This is not unexpected since SVMs minimize maximum distance from decision boundaries whereas the training of the linear classifier minimizes error rate (over training is not a problem for a simple 1-parameter linear classifier). Surprisingly, the linear classifier compares well even with non-linear SVMs. One more interesting observation can be made. A posteriori, a threshold (point on the ROC curve) can be found for the HMM where this expert outperforms the face and HMM combination. However, at the threshold predictedfrom training and evaluation data the weighted sum of Face and HMM expert has a lower error. This suggest that more stable prediction of the operating point can be made for the fused data. Table 1: Performance of Modalities on Test Set 0 Modalities HMM and Face Spher. and Face HMM and Spher. Gaussian: K ( x , y ) = ezp(-4llz - y1I2) 11I Kernel 11 The dimensionality of the data corresponds t o the number of modalities to combine. Moreover, SVM computes only dot products with the data and therefore the complexity of SVM is independent from the number of modalities t o combine. As a baseline fusion experiment we combined the output of the HMM, 1 Polvnomial I Gaussian Set FA" FR FA c1 1.07 0.34 0.39 0.13 0.25 0.50 0.50 10.0 1.18 0.78 0.38 1.18 c2 c3 c4 11 I I I FR 0 0 0.50 0 I FA I 1.47 1.47 1.47 1.23 Table 3: SVM Fusion Performance http://www.ee.surrey.ac.uk/Research/VSSP/xm2vts 584 II Linear FR 0 0 0 I 1.25 7 Conclusion [9] F. R. Hampel, E. M . Ronchetti, P.J. Rouseseeuw, and W.A. Stahel. Robust Statistics. John Wiley, 1986. We have described a complete multi-modal person identity verification system with very low error rates (less than 1% total error rate). It was evaluated and tested on a large database (295 people) with a published protocol. Combining different modalities increases the performance of the system and yields better results than individual modalities. One of the major problems is how to combine modalities with different skills. We compared two approaches: a linear weighted classifier and SVM. The linear classifier performed well and even better than linear SVM in combining two modalities (face/speech). SVM has the advantage of combining any number of modalities a t the same computational cost with very good fusion results. [lo] J.Kittler and A Hojjatoleslami. A weighted combination of classifiers employing shared and distinct representations. In Proc. Conference on C V P R , pages 924-929, 1998. [ll] K. Jonsson, J . Matas, and J . Kittler. Fast face localisation and verification by optimised robust correlation. Technical report, U. of Surrey, Guildford, Surrey, United Kingdom, 1997. [12] P. Jourlin, J . Luettin, D. Genoud, and H. Wassner. Acoustic-labial speaker verification. Pattern Recognition Letters, 18(9)~853-858,1997. References (131 S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671-680, May 1983. B.S. Atal. Effectivness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. J A S A , 55(6):13041312, 1974. [14] J . Kittler, M. Hatef, R.P.W Duin, and J . Matas. On Combining Classifiers. IEEE PAMI, 20(3):226-239, 1998. S. Ben-Yacoub. Multi-Modal Data Fusion for Person Authentication using SVM. In Proc. of AVBPA '99, Washington DC, pages 25-30, 1999. [15] J . Kittler. Combining classifiers: A theoretical framework. Pattern Analysis and Applications, 1:18-27, 1998. F. Bimbot, I. Magrin-Chagnolleau, and L. Mathan. Second-order statistical measure for text-independent speaker identification. Speech Communication, 17(1-2):177-192, 1995. [16] J . Luettin and G. Maitre. Evaluation protocol for the extended m2vts database (xm2vtsdb). Technical Report IDIAP-COM 98-05, IDIAP, 1998. M. Bober and J . Kittler. Robust motion analysis. In CVPR'94, pages 947-952, Washington, DC., Jun 1994. Computer Society Press. [17] J . Matas, K . Jonsson, and J . Kittler. Fast face localisation and verification. In A. Clark, editor, British Machine Vision Conference, pages 152161. BMVA Press, 1997. R. Brunelli and D. Falavigna. Person identification using multiple cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(10):955-966, October 1995. [18] D.A. Reynolds, R.C. Rose, and M.J.T. Smith. Pcbased tms320c30 implementation of the gaussian mixture model text-independent speaker recognition system. In ICSPAT, DSP Associates, pages 967-973, 1992. T . Choudhury, B. Clarkson, T . Jebara, and A. Pentland. Multimodal Person Recognition using Unconstrained Audio and Video. In Proc. of AVBPA '99, Washington DC, pages 176-180, 1999. [19] A. E. Rosenberg, C. H. Lee, and S. Gokoen. Connected word talker verification using whole word hidden markov model. In ICASSP-91, pages 381384, 1991. Benoit DUC,Elizabeth Saers Bigiin, Josef Bigiin, Gilbert Maitre, and Stefan Fischer. Fusion of audio and video information for multi modal person authentication. Pattern Recognition Letters, 18(9):835-843, 1997. [20] V. Vapnik. Statistical Learning Theory. Wiley Inter-Science, 1998. [all J. Zhang, Y. Yan, and M. Lades. Face recognition: Eigenfaces, elastic matching, and neural nets. Proceedings of IEEE, 85:1422-1435, 1997. D. Gibbon, R. Moore, and R. Winski, editors. Handbook of Standards and Resources for Spoken Language Systems. Mouton de Gruyter, 1997. 585

Log In

Audio-visual person verification

Related papers

Related papers

Related topics