Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
EURASIP Journal on Applied Signal Processing 2002:11, 1248–1259 c 2002 Hindawi Publishing Corporation  A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications Mihaela Gordan Department of Informatics, Aristotle University of Thessaloniki, Box 451, Thessaloniki 54006, Greece Email: mihag@zeus.csd.auth.gr Constantine Kotropoulos Department of Informatics, Aristotle University of Thessaloniki, Box 451, Thessaloniki 54006, Greece Email: costas@zeus.csd.auth.gr Ioannis Pitas Department of Informatics, Aristotle University of Thessaloniki, Box 451, Thessaloniki 54006, Greece Email: pitas@zeus.csd.auth.gr Received 26 November 2001 and in revised form 26 July 2002 Visual speech recognition is an emerging research field. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the different phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterbi lattice. We test the performance of the proposed approach on a small visual speech recognition task, namely the recognition of the first four digits in English. The word recognition rate obtained is at the level of the previous best reported rates. Keywords and phrases: visual speech recognition, mouth shape recognition, visemes, phonemes, support vector machines, Viterbi lattice. 1. INTRODUCTION Audio-visual speech recognition is an emerging research field where multimodal signal processing is required. The motivation for using the visual information in performing speech recognition lays on the fact that the human speech production is bimodal by its nature. In particular, human speech is produced by the vibration of the vocal cords and depends on the configuration of the articulatory organs, such as the nasal cavity, the tongue, the teeth, the velum, and the lips. A speaker produces speech using these articulatory organs together with the muscles that generate facial expressions. Because some of the articulators, such as the tongue, the teeth, and the lips are visible, there is an inherent relationship between the acoustic and visible speech. As a consequence, the speech can be partially recognized from the information of the visible articulators involved in its production and in particular from the image region comprising the mouth [1, 2, 3]. Undoubtedly, the most useful information for speech recognition is carried by the acoustic signal. When the acoustic speech is clean, performing visual speech recognition and integrating the recognition results from both modalities does not bring too much improvement because the recognition rate from the acoustic information alone is very high, if not perfect. However, when the acoustic speech is degraded by noise, adding the visual information to the acoustic one improves significantly the recognition rate. Under noisy conditions, it has been proved that the use of both modalities for speech recognition is equivalent to a gain of 12 dB in the signal-to-noise ratio of the acoustic signal [1]. For large vocabulary speech recognition tasks, the visual signal can also provide a performance gain when it is integrated with the acoustic signal, even in the case of a clean acoustic speech [4]. Visual speech recognition refers to the task of recognizing the spoken words based only on the visual examination of the speaker’s face. This task is also referred to as lipreading, since the most important visible part of the face examined for information extraction during speech is the mouth area. Different shapes of the mouth (i.e., different mouth openings and different position of the teeth and tongue) realized during speech cause the production of different sounds. We can establish a correspondence between the mouth shape and A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications the phone produced, even if this correspondence is not oneto-one, but one-to-many, due to the involvement of invisible articulatory organs in the speech production. For small vocabulary word recognition tasks, we can perform good quality speech recognition using the visual information conveyed by the mouth shape only. Several methods have been reported in the literature for visual speech recognition. The adopted methods vary widely with respect to: (1) the feature types, (2) the classifier used, and (3) the class definition. For example, Bregler and Omohundro [5] used time delayed neural networks (TDNN) for visual classification and the outer lip contour coordinates as visual features. Luettin and Thacker [6] used active shape models to represent the different mouth shapes and gray level distribution profiles (GLDPs) around the outer and/or inner lip contours as feature vectors, and finally built whole-word hidden Markov model (HMM) classifiers for visual speech recognition. Movellan [7] employed also HMMs to build the visual word models, but he used directly the gray levels of the mouth images as features after simple preprocessing to exploit the vertical symmetry of the mouth. In recent works, Movellan et al. [8] have reported very good results when partially observable stochastic differential equation (SDE) models are integrated in a network as visual speech classifiers instead of HMMs, and Gray et al. [9] have presented a comparative study of a series of different features based on principal component analysis (PCA) and independent component analysis (ICA) in an HMM-based visual speech recognizer. Despite the variety of existing strategies for visual speech recognition, there is still ongoing research in this area attempting to: (1) find the most suitable features and classification techniques to discriminate effectively between the different mouth shapes, while preserving in the same class the mouth shapes produced by different individuals that correspond to one phone; (2) require minimal processing of the mouth image to allow for a real time implementation of the mouth shape classifier; (3) facilitate the easy integration of audio and video speech recognition modules [1]. In this paper, we contribute to the first two of the aforementioned aspects in visual speech recognition by examining the suitability of support vector machines (SVMs) for visual speech recognition tasks. The idea is based on the fact that SVMs have been proved powerful classifiers in various pattern recognition applications, such as face detection, face verification/recognition, and so forth [10, 11, 12, 13, 14, 15]. Very good results in audio speech recognition using SVMs were recently reported in [16]. No attempts in applying SVMs for visual speech recognition have been reported so far. According to the authors’ knowledge, the use of SVMs as visual speech classifiers is a novel idea. One of the reasons that partially explains why SVMs have not been exploited in automatic speech recognition so far is that they are inherently static classifiers, while speech is a dynamic process where the temporal information is essential for recognition. A solution to this problem was presented in [16] where a combination of HMMs with SVMs is proposed. In this paper, a similar strategy is adopted. We will use Viterbi lattices to create dynamically visual word models. 1249 The approaches for building the word models can be classified into the approaches where whole word models are developed [6, 7, 16] and those where viseme-oriented word models are derived [17, 18, 19]. In this paper, we adopt the latter approach because it is more suitable for an SVM implementation and offers the advantage of an easy generalization to large vocabulary word recognition tasks without a significant increase in storage requirements. It maintains also the dictionary of basic visual models needed for word modeling into a reasonable limit. The word recognition rate obtained is on the level of the best previous reported rates in literature, although we will not attempt to learn the state transition probabilities. When very simple features (i.e., pixels) are used, our word recognition rate is superior to the ones reported in the literature. Accordingly, SVMs are a promising alternative for visual speech recognition and this observation encourages further research in that direction. It is well known that the MortonMassaro law (MML) holds when humans integrate audio and visual speech [20]. Experiments have demonstrated that MML holds also for audio-visual speech recognition systems. That is, the audio and visual speech signals may be treated as if they were conditionally independent without significant loss of information about speech categories [20]. This observation supports the independent treatment of audio and visual speech and yields an easy integration of the visual speech recognition module and the acoustic speech recognition module. The paper is organized as follows. In Section 2, a short overview on SVM classifiers is presented. We review the concepts of visemes and phonemes in Section 3. We discuss the proposed SVM-based approach to visual speech recognition in Section 4. Experimental results obtained when the proposed system is applied to a small vocabulary visual speech recognition task (i.e., the visual recognition of the first four digits in English) are described in Section 5 and compared to other results published in the literature. Finally, in Section 6, our conclusions are drawn and future research directions are identified. 2. OVERVIEW ON SVMS AND THEIR APPLICATIONS IN PATTERN RECOGNITION SVMs constitute a principled technique to train classifiers that stems from statistical learning theory [21, 22]. Their root is the optimal hyperplane algorithm. They minimize a bound on the empirical error and the complexity of the classifier at the same time. Accordingly, they are capable of learning in sparse high-dimensional spaces with relatively few training examples. Let {xi , yi }, i = 1, 2, . . . , N, denote N training examples where xi comprises an M-dimensional pattern and yi is its class label. Without loss of generality, we will confine ourselves to the two-class pattern recognition problem. That is, yi ∈ {−1, +1}. We agree that yi = +1 is assigned to positive examples, whereas yi = −1 is assigned to counterexamples. The data to be classified by the SVM might or might not be linearly separable in their original domain. If they are 1250 EURASIP Journal on Applied Signal Processing separable, then a simple linear SVM can be used for their classification. However, the power of SVMs is demonstrated better in the nonseparable case when the data cannot be separated by a hyperplane in their original domain. In the latter case, we can project the data into a higher-dimensional Hilbert space and attempt to linearly separate them in the higher-dimensional space using kernel functions. Let Φ denote a nonlinear map Φ : RM → Ᏼ where Ᏼ is a higherdimensional Hilbert space. SVMs construct the optimal separating hyperplane in Ᏼ. Therefore, their decision boundary is of the form f (x) = sign  N     αi yi K x, xi + b , i=1 (1) where K(z1 , z2 ) is a kernel function that defines the dot product between Φ(z1 ) and Φ(z2 ) in Ᏼ, and αi are the nonnegative Lagrange multipliers associated with the quadratic optimization problem that aims to maximize the distance between the two classes measured in Ᏼ subject to the constraints wT Φ xi + b ≥ 1 T     w Φ xi + b ≤ 1 for yi = +1, for yi = −1, (2) where w and b are the parameters of the optimal separating hyperplane in Ᏼ. That is, w is the normal vector to the hyperplane, |b|/ w is the perpendicular distance from the hyperplane to the origin, and w denotes the Euclidian norm of vector w. The use of kernel functions eliminates the need for an explicit definition of the nonlinear mapping Φ, because the data appears in the training algorithm of SVM only as dot products of their mappings. Frequently used, kernel functions are the polynomial kernel K(xi , x j ) = (mxiT x j + n)q and the radial basis function (RBF) kernel K(xi , x j ) = exp{−γ|xi − x j |2 }. In the following, we omit the sign function from the decision boundary (1) that simply makes the optimal separating hyperplane an indicator function. To enable the use of SVM classifiers in visual speech recognition when we model the speech as a temporal sequence of symbols corresponding to the different phones produced, we will employ the SVMs as nodes in a Viterbi lattice. But the nodes of such a Viterbi lattice should generate the posterior probabilities for the corresponding symbols to be emitted [23] and the standard SVMs do not provide such probabilities as output. Several solutions are proposed in the literature to map the SVM output to probabilities: the cosine decomposition proposed by Vapnik [21], the probabilistic approximation by applying the evidence framework to SVMs [24], and the sigmoidal approximation by Platt [25]. Here we adopt the solution proposed by Platt [25] since it is a simple solution which was already used in a similar application of SVMs to audio speech recognition [16]. The solution proposed by Platt shows that having a trained SVM, we can convert its output to probability by training the parameters a1 and a2 of a sigmoidal mapping function, and that this produces a good mapping from SVM margins to probability. In general, the class-conditional densities on either side of the SVM hyperplane are exponential. So, Bayes’ rule [26] on two exponentials suggests the use of the following parametric form of a sigmoidal function:   P y = +1 | f (x) = where 1  , 1 + exp a1 f (x) + a2 (3) (i) y is the label for x, given by the sign of f (x) (y = +1 if and only if f (x) > 0), (ii) f (x) is the function value on the output of an SVM classifier for the feature vector x to be classified, (iii) a1 and a2 are the parameters of the sigmoidal mapping to be derived for the currently trained SVM under consideration with a1 < 0. P(y = −1 | f (x)) could be defined similarly. However, since each SVM represents only one data category (i.e., the positive examples), we are interested only in the probability given by (3). The latter equation gives directly the posterior probability to be used in a Viterbi lattice. The parameters a1 and a2 are derived from a training set ( f (xi ), yi ) using maximum likelihood estimation. In the adopted approach, we use the training set of the SVM, (xi , yi ), i = 1, 2, . . . , N, to estimate the parameters of the sigmoidal function. The estimation starts with the definition of a new training set, ( f (xi ), ti ), i = 1, 2, . . . , N, where ti are the target probabilities. The target probabilities are defined as follows. (i) When a positive example (i.e., yi = +1) is observed at a value f (xi ), we assume that this example is probably in the class represented by the SVM, but there is still a small finite probability ǫ+ for getting the opposite label at the same f (xi ) for some out-of-sample data. Thus, ti = t+ = 1 − ǫ+. (ii) When a negative example (i.e., yi = −1) is observed at a value f (xi ), we assume that this example is probably not in the class represented by the SVM, but there is still a small finite probability ǫ− for getting the opposite label at the same f (xi ) for some out-of-sample data. Thus, ti = t− = ǫ−. Denote by N+ the number of positive examples in the training set (xi , yi ), i = 1, 2, . . . , N. Let N− be the number of negative examples in the training set. We set t+ = 1 − ǫ+ = (N+ + 1)/(N+ + 2) and t− = ǫ− = 1/(N− + 2). The parameters a1 and a2 are found by minimizing the negative log likelihood of the training data which is a crossentropy error function given by   Ᏹ a1 , a2 = − N  i=1       ti log pi + 1 − ti log 1 − pi , (4) where   t+ , ti =  pi = t− , for yi = +1, for yi = −1, 1    . 1 + exp a1 f xi + a2 (5) (6) A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications 1251 In (4) and (6), pi , i = 1, 2, . . . , N, is the value of the sigmoidal mapping for the training example xi , where f (xi ) is the realvalued output of the SVM for this example. Due to the negative sign of a1 , pi tends to 1 if xi is a positive example (i.e., f (xi ) > 0) and to 0 if xi is a negative example (i.e., f (xi ) < 0). 3. VISEMES AND PHONEMES 3.1. Phonetic word description The basic units of the acoustic speech are the phones. Roughly speaking, a phone is an acoustic realization of a phoneme, a theoretical unit for describing how speech conveys linguistic meaning. The acoustic realization of a phoneme depends on the speaker’s characteristics, the word context, and so forth. The variations in the pronunciation of the same phoneme are called allophones. In the technical literature, a clear distinction between phones and phonemes is seldom made. In this paper, we are dealing with speech recognition in English, so we will focus on this particular case. The number of phones in the English language varies in the literature [27, 28]. Usually there are about 10–15 vowels or vowellike phones and 20–25 consonants. The most commonly used computer-based phonetic alphabet in American English is ARPABET which consists of 48 phones [2]. To convert the orthographic transcription of a word in English to its phonetic transcription, we can use the publicly available Carnegie Mellon University (CMU) pronunciation dictionary [29]. The CMU pronunciation dictionary uses a subset of the ARPABET consisting of 39 phones. For example, the CMU phonetic transcription of the word “one” is “W-AHN”. 3.2. The concept of viseme Similarly to the acoustic domain, we can define the basic unit of speech in the visual domain, the viseme. In general, in the visual domain, we observe the image region of the speaker’s face that contains the mouth. Therefore, the concept of viseme is usually defined in relation to the mouth shape and the mouth movements. An example where the concept of viseme is related to the mouth dynamics is the viseme OW which represents the movement of the mouth from a position close to O to a position close to W [2]. In such a case, to represent a viseme, we need to use a video sequence, a fact that would complicate the processing of the visual speech to some extent. However, fortunately, most of the visemes can be approximately represented by stationary mouth images. Two examples of visemes defined in relation to the mouth shape during the production of the corresponding phones are given in Figure 1. 3.3. Phoneme to viseme mappings To be able to perform visual speech recognition, ideally we would like to define for each phoneme its corresponding viseme. In this way, each word could be unambiguously described according to its pronunciation in the visual domain. Unfortunately, invisible articulatory organs are also involved in speech production that renders the mapping of phonemes (a) (b) Figure 1: (a) Mouth shape during the realization of phone /O/; (b) mouth shape during the realization of phone /F/, by the subject Anthony in the Tulips1 database [7]. Table 1: The most used viseme groupings for the English consonants [1]. Viseme group index Corresponding consonants 1 2 3 4 5 6 7 8 9 /F/; /V/ /TH/; /DH/ /S/; /Z/ /SH/; /ZH/ /P/; /B/; /M/ /W/ /R/ /G/; /K/; /N/; /T/; /D/; /Y/ /L/ to visemes into many-to-one. Thus, there are phonemes that cannot be distinguished in the visual domain. For example, the phonemes /P/, /B/, and /M/ are all produced with a closed mouth and are visually indistinguishable, so they will be represented by the same viseme. We also have to consider the dual aspect corresponding to the concept of allophones in the acoustic domain. The same viseme can have different realizations represented by different mouth shapes due to the speaker variability and the context. Unlike the phonemes, in the case of visemes there are no commonly accepted viseme tables by all researchers [1], although several attempts toward this direction have been undertaken. For example, it is commonly agreed that the visemes of the English consonants can be grouped into 9 distinct groups, as in Table 1 [1]. To obtain the viseme groupings, the confusions in stimulus-response matrices measured on an experimental basis are analyzed. In such experiments, subjects are asked to visually identify syllables in a given context such as vowel-consonant-vowel (V-C-V) words. Then, the stimulus-response matrices are tabulated and the visemes are identified as those clusters of phonemes in which at least 75% of all responses occur. This strategy will lead to a systematic and application-independent mapping of phonemes to visemes. Average linkage hierarchical clustering [18] and self-organizing maps [17] were employed to group visually similar phonemes based on geometric features. Similar techniques could be applied for raw images from mouth regions as well. 1252 However, in this paper, we do not resort to such strategies because our main goal is the evaluation of the proposed visual speech recognition method. Thus, we define only those visemes that are strictly needed to represent the visual realization of the small vocabulary used in our application and manually classify the training images to a number of predefined visemes, as explained in Section 5. 4. THE PROPOSED APPROACH TO VISUAL SPEECH RECOGNITION Depending on the approach used to model the spoken words in the visual domain, we can classify the existing visual speech recognition systems to systems using word-oriented models and those using viseme-oriented models [4]. In this paper, we develop viseme-oriented models. Visemic-based lipreading was investigated also in [17, 18]. Each visual word model can be represented afterwards as a temporal sequence of visemes. Thus, the structure of the visual word modeling and recognition system can be regarded as a two-level structure. (1) At the first level, we build the viseme classes, one class of mouth images for each viseme defined. This implies the formulation of the mouth shape recognition problem as a pattern recognition problem. The patterns to be recognized are the mouth shapes, symbolically represented as visemes. In our approach, the classification of mouth shapes to viseme classes is formulated as a two-class (binary) pattern recognition problem and there is one SVM dedicated for each viseme class. (2) At the second level, we build the abstract visual word models described as temporal sequences of visemes. The visual word models are implemented by means of the Viterbi lattices where each node generates the emission probability of a certain viseme at one particular time instant. Notice that the aforementioned two-level approach is very similar to some techniques employed for acoustic speech recognition [16], justifying thus our expectation that the proposed method will ensure an easy integration of the visual speech recognition subsystem with a similar acoustic speech recognition subsystem. In this section, we focus on the first level of the proposed algorithm for visual speech modeling and recognition. The second level involves the development of the visual symbolic sequential word models using the Viterbi lattices. The latter level is discussed only in principle. 4.1. Formulation of visual speech recognition as a pattern recognition problem The problem of discriminating between different mouth shapes during speech production can be viewed as a pattern recognition problem. In this case, the set of patterns is a set of feature vectors {xi }, i = 1, 2, . . . , P, each of them describing some mouth shape. The feature vector xi is a representation of the mouth image. The feature vector xi can represent the mouth image at low level (i.e., the gray levels from a rectangular image region containing the mouth). It can comprise geometric parameters (i.e., mouth width, height, EURASIP Journal on Applied Signal Processing perimeter, etc.) or the coefficients of a linear transformation of the mouth image. All the feature vectors from the set have the same number of components M. Denote the pattern classes by Ꮿ j , j = 1, 2, . . . , Q, where Q is the total number of classes. Each class Ꮿ j is a group of patterns that represent mouth shapes corresponding to one viseme. A network of Q parallel SVMs is designed where each SVM is trained to classify test patterns in class Ꮿ j or its complement ᏯCj . We should slightly deviate from the notation introduced in Section 2 because a test pattern xi could be assigned to more than one class. It is convenient to represent the class label of a test pattern, xk , by a (Q × 1) vector yk whose jth element, yk j , admits the value 1 if xk ∈ Ꮿ j and −1 otherwise. It may occur more than one element of yk to have the value 1 if f j (xk ) > 0, where f j (xk ) is the decision function of the jth SVM. To derive an unambiguous classification, we will use SVMs with probabilistic outputs, that is, the output of the jth SVM classifier will be the posterior probability for the test pattern xk to belong to the class Ꮿ j , P(y j = 1 | f j (xk )), given by (3). This pattern recognition problem can be applied to visual speech recognition in the following way: (i) each unknown pattern represents the image of the speaker’s face at a certain time instant; (ii) each class label represents one viseme. Accordingly, we will identify what the probability of a viseme to be produced at any time instant in the spoken sequence is. This gives the solution required at the first level of the proposed visual speech recognition system to be passed to the second level. The network of Q parallel SVMs is shown in Figure 2. 4.2. The basic structure of the SVM network for visual speech recognition The phonetic transcription represents each word by a left-toright sequence of phonemes. Moreover, the visemic model corresponding to the phonetic model of a word can be easily derived using a phoneme-to-viseme mapping. However, the aforementioned representation shows only which visemes are present in the pronunciation of the word, not the duration of each viseme. Let Ti , i = 1, 2, . . . , S, denote the duration of the ith viseme in a word model of S visemes. Let T be the duration of the video sequence that results from the pronunciation of this word. In order to align the video sequence of duration T with the symbolic visemic model of S visemes, we can create a temporal Viterbi lattice [23] containing as many states as the frames in the video sequence, that is, T. Such a Viterbi lattice that corresponds to the pronunciation of the word “one” is depicted in Figure 3. For this example, the visemes present in the word pronunciation have been denoted with the same symbols as the underlying phones. Let D be the total number of visemic models defined for the words in the vocabulary. Each visemic model wd , d = 1, 2, . . . , D, has its own Viterbi lattice. Each node in the A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications Visual features xk 1253 SVM1 p(y1 = 1 | f1 (xk )) SVM2 p(y2 = 1 | f2 (xk )) SVM3 p(y3 = 1 | f3 (xk )) . .. SVMQ p(yQ = 1 | fQ (xk )) Figure 2: Illustration of the parallel network of binary classifiers for viseme recognition. Visemic symbolic model The probability that the visemic word model wd is realized can be computed by ᏸ pd = max pℓ , N ℓ =1 AH W 1 2 3 4 5 Temporal frame Figure 3: A temporal Viterbi lattice for the pronunciation of the word “one” in a video sequence of 5 frames. lattice of Figure 3 is responsible for the generation of one observation that belongs to a certain class at each time instant. Let lk = 1, 2, . . . , Q be the class label where the observation ok generated at time instant k belongs to. Let us denote the emission probability of that observation by blk (ok ). Each solid line between any two nodes in the lattice represents a transition probability between two states. Denote by alk ,lk+1 the transition probability from the node corresponding to the class lk at time instant k to the node corresponding to the class lk+1 at time instant k + 1. The class labels lk and lk+1 may or may not be different. Having a video sequence of T frames for a word and a Viterbi lattice for each visemic word model wd , d = 1, 2, . . . , D, we can compute the probability that the visemic word model wd is realized, following a path ℓ in the Viterbi lattice as T −1 T pd,ℓ =   alk ,lk+1 . blk ok · k=1 k=1 (7) (8) where ᏸ is the number of all possible paths in the lattice. Among the words that can be realized following any possible path in any of the D Viterbi lattices, the word described by the model whose probability pd , d = 1, 2, . . . , D, is maximum (i.e., the most probable word) is finally recognized. In the visual speech recognition approach discussed in this paper, the emission probability blk (ok ) is given by the corresponding SVM, SV Mlk . To a first approximation, we assume equal transition probabilities alk ,lk+1 between any two states. Accordingly, it is sufficient to take into account only the probabilities blk (ok ), k = 1, 2, . . . , T, in the computation of the path probabilities pd,ℓ which yields the simplified equation T pd,ℓ =   blk ok . k =1 (9) Of course, learning the probabilities alk lk+1 from word models would yield a more refined modeling. This could be a topic of future work. 5. EXPERIMENTAL RESULTS To evaluate the recognition performance of the proposed SVM-based visual speech recognizer, we choose to solve the task of recognizing the first four digits in English. Towards this end, we used the small audiovisual database Tulips1 [7] frequently used in similar visual speech recognition experiments. While the number of the words is small, this database is challenging due to the differences in illumination conditions, ethnicity, and gender of the subjects. Also we must mention that, despite the small number of words 1254 EURASIP Journal on Applied Signal Processing Table 2: Viseme classes defined for the four words of the Tulips1 database [7]. Viseme group index Symbolic notation Viseme description 1 2 3 4 5 6 7 8 9 (W) (AO) (WAO) (AH) (N) (T) (TH) (IY) (F) Small-rounded open mouth state Larger-rounded open mouth state Medium-rounded open mouth state Medium ellipsoidal mouth state Medium open, not rounded, mouth state; teeth visible Medium open, not rounded, mouth state; teeth and tongue visible Medium open, not rounded Longitudinal open mouth state Almost closed mouth position; upper teeth visible, lower lip moved inside pronounced in the Tulips1 database compared to vocabularies for real-world applications, the portion of phonemes in English covered by these four words is large enough: 10 out of 48 appearing in the ARPABET table, that is, approximately 20%. Since we use viseme-oriented models and the visemes are actually just representations of phonemes in the visual domain, we can consider the results described in this section as significant. Solving the proposed task requires first the design of a particular visual speech recognizer according to the strategy presented in Section 4. The design involves the following steps: (1) (2) (3) (4) to define the viseme to phoneme mapping; to build the SVM network; to train the SVMs for viseme classification; to generate and implement the word models as Viterbi lattices. Then, we use the trained visual speech recognizer to assess its recognition performance in test video sequences. Table 3: Phoneme-to-viseme mapping used in the experiments conducted on the Tulips1 database [7]. Viseme group index Corresponding phonemes 1, 2, or 3 (depending on speaker’s pronunciation) /W/, /UW/, /AO/ 1 or 3 (depending on speaker’s pronunciation) /R/ 4 5 6 7 8 or 4 (depending on speaker’s pronunciation) /AH/ /N/ /T/ /TH/ 9 /F/ /IY/ 5.1. Experimental protocol We start the design of the visual speech recognizer with the definition of the viseme classes for the first four digits in English. We first obtain the phonetic transcriptions of the first four digits in English using the CMU pronunciation dictionary [29]: “one” → “W-AH-N” “two” → “T-UW” “three” → “TH-R-IY” “four” → “F-AO-R”. We then try to define the viseme classes so that (i) a viseme class includes as few phonemes as possible; (ii) we have as few different visual realizations of the same viseme as possible. The definition of viseme classes was based on the visual examination of the video part from the Tulips1 database. The clustering of the different mouth images into viseme classes was done manually on the base of the visual similarity of these images. By this procedure, we obtained the viseme classes described in Table 2 and the phoneme-to-viseme mapping given in Table 3. We have to define and train one SVM for each viseme. To employ SVMs, we should define the features to be used to represent each mouth image and select the kernel function to be used. Since the recognition and generalization performance of each SVM is strongly influenced by the selection of the kernel function and the kernel parameters, we devoted much attention to these issues. We trained each SVM using the linear, the polynomial, and the RBF as kernel functions. In the case of the polynomial kernel, the degree of the polynomial q was varied between 2 and 6. For each trained SVM, we compared the predicted error, precision, and recall on the training set, as computed by SVMLight [30], for the different kernels and kernel parameters. We finally selected the simplest kernel yielding the best values for these estimates. That kernel was the polynomial kernel of degree q = 3. The RBF kernel gave the same performance estimates with the polynomial kernel of degree q = 3 on the training set but at the A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications cost of a larger number of support vectors. A simple choice of a feature vector such as the collection of the gray levels from a rectangular region of fixed size containing the mouth, scanned row by row, is proved suitable whenever SVMs have been used for visual classification tasks [15]. More specifically, we used two types of features to conduct the visual speech recognition experiments. (i) The first type comprised the gray levels of a rectangular region of interest around the mouth, downsampled to the size 16 × 16. Each mouth image is represented by a feature vector of length 256. (ii) The second type represented each mouth image frame at the time T f by a vector of double size (i.e., 512) that comprised the gray levels of the rectangular region of interest around the mouth downsampled to the size 16 × 16, as previously. The temporal derivatives of the gray levels normalized to the range [0, Lmax − 1], where Lmax is the maximum gray level value in mouth image. The temporal derivatives are simply the pixel by pixel gray level differences between the frames T f and T f − 1. These differences are the so-called delta features. Some preprocessing of the mouth images was needed before training and testing the visual speech recognition system. It concerns the normalization of the mouth in scale, rotation, and position inside the image. Such a preprocessing is needed due to the fact that the mouth has different scale, position in the image, and orientation toward the horizontal axis from utterance to utterance depending on the subject and on its position in front of the camera. To compensate for these variations, we applied the normalization procedure of mouth images with respect to scale, translation, and rotation described in [6]. The visual speech recognizer was tested for speakerindependent recognition using the leave-one-out testing strategy for the 12 subjects in the Tulips1 database. This implies training the visual speech recognizer 12 times, each time using only 11 subjects for training and leaving the 12th out for testing. In each case, we trained first the SVMs, and then the sigmoidal mappings for converting the SVMs output to probabilities. The training set, for each SVM in each system configuration is defined manually. Only the video sequences, from the so-called Set 1 from the Tulips1 database, were used for training. The labeling of all the frames from Set 1 (a total of 48 video sequences) was done manually by visual examination of each frame. We examined the video only to label all the frames according to Table 3 except the transition frames between two visemes denoting differently the same viseme class for each subject. Finally, we compared the similarity of the frames corresponding to the same viseme and different subjects and decided if the classes could be merged. The disadvantage of this approach is the large time needed for labeling, which would not be needed if HMMs were used for segmentation. A compromise solution for labeling could be the use of an automatic solution for phoneme-level segmentation of the audio sequence and the use of this segmentation on the aligned video sequence also. Once the labeling was done, only the unambiguous positive and negative viseme examples were included in the train- 1255 ing sets. The feature vectors used in the training sets of all SVMs were the same. Only their labeling as positive or negative examples differs from one SVM to another. This leads to an unbalanced training set in the sense that the negative examples are frequently more than the positive ones. The configuration of the Viterbi lattice depends on the length of the test sequence through the number of frames Ttest of the sequence (as illustrated in Figure 3). It was generated automatically at runtime for each test sequence. The number of Viterbi lattices can be determined in advance, because it is equal to the total number of visemic word models. Thus, taking into account the phonetic descriptions for the four words of the vocabulary and the phoneme-to-viseme mappings in Table 3, we have 3 visemic word models for the word “one,” 3 models for “two,” 4 models for “three,” and 6 models for “four.” The multiple visemic models per word are due to the variability in speakers’ pronunciation. In each of the 12 leave-one-out tests, we have as test sequences, the video sequences corresponding to the pronunciation of the four words and there are two pronunciations available for each word and the speaker. This leads to a subtotal of 8 test sequences per system configuration, and a total of 12 × 8 = 96 test sequences for the visual speech recognizer. The complete visual speech recognizer was implemented in C++. We used the publicly available SVMLight toolkit modules for the training of the SVMs [30]. We implemented in C++, the module for learning the sigmoidal mappings of the SVMs output to probabilities and the module for generating the Viterbi lattice models based on SVMs with probabilistic outputs. All these modules were integrated into the visual speech recognition system whose architecture is structured into two modules: the training module and the test module. Two visual speech recognizers were implemented, trained, and tested with the aforementioned strategy. They differ in the type of features used. The first system (without delta features) did not include temporal derivatives in the feature vector, while the second (with delta features) included also temporal derivatives between two frames in the feature vector. 5.2. Performance evaluation In this section, we present the experimental results obtained with the proposed system with or without using delta features. Moreover, we compare these results to others reported in the literature for the same experiment on the Tulips1 database. The word recognition rates (WRR) have been averaged over the 96 tests obtained by applying the leave-one-out principle. Five figures of merit are provided. (1) The WRR per subject, obtained by the proposed method when delta features are used, is measured and compared to that by Luettin and Thacker [6] (Table 4). (2) The overall WRR for all subjects and pronunciations, with and without delta features, is reported compared to that obtained by Luettin and Thacker [6], Movellan [7], Gray et al. [9], and Movellan et al. [8] (Table 5). (3) The confusion matrix between the words actually 1256 EURASIP Journal on Applied Signal Processing Table 4: WRR for each subject in Tulips1 using: (a) SVM dynamic network with delta features; (b) active appearance model (AAM) for inner and outer lip contours and HMM with delta features [6]. Subject 1 Accuracy [%] (SVM-based dynamic network) 100 Accuracy [%] (AAM&HMM [6]) 2 75 3 4 5 6 7 8 9 10 11 12 100 100 87.5 100 87.5 100 100 62.5 87.5 87.5 100 87.5 87.5 75 100 100 75 100 100 75 100 87.5 Table 5: The overall WRR of the SVM dynamic network compared to that of other techniques. Method SVM-based dynamic network without delta features SVM-based dynamic network with delta features AAM and HMM shape + intensity inner + outer lip contour without delta features [6] AAM and HMM shape + intensity inner + outer lip contour with delta features [6] HMMs [7] without delta features HMMs [7] with delta features WRR [%] 76 90.6 87.5 90.6 60 89.93 blocked filter bank PCA/ICA (local) [9] unblocked filter bank PCA/ICA (local) [9] Diffusion network shape + intensity [8] 85.4 91.7 91.7 Method WRR [%] Global PCA and HMMs [9] 79.2 Global ICA and HMMs [9] 74 presented to the classifier and the words recognized is shown in Table 6 and compared to the average human confusion matrix [7] (Table 7) in percentages. (4) The accuracy of the viseme segmentations resulting from the Viterbi lattices. (5) The 95% confidence intervals for the WRRs of the several systems included in the comparisons (Table 8) that provide an estimate of the performance of the systems for a much larger number of subjects. We would like to note that human subjects untrained in lipreading achieved, under similar experimental conditions, a WRR of 89.93%, whereas the hearing impaired had an average performance of 95.49% [7]. From the examination of Table 5, it can be seen that our WRR is equal to the best rate reported in [6] and just 1.1% below the recently reported rates in [8, 9]. However, the features used in the proposed method are simpler than those used with HMMs to obtain the same or higher WRRs. For the shape + intensity models [6], the gray levels should be sampled in the exact subregion of the mouth image containing the lips and around the inner and outer lip contours. It should also exclude the skin areas. Accordingly, the method reported in [6] requires the tracking of the lip contour in each frame which increases the processing time of visual speech recognition. For the method reported in [9], a large amount of local processing is needed, by the use of a bank of linear shift invariant filters with unblocked selection whose response filters are ICA or PCA kernels of very small size (12 × 12 pixels). The obtained WRR is higher than those reported in [7] where similar features are used, namely the gray levels of the region of interest (ROI) comprising the mouth after some simple preprocessing steps. The preprocessing in [7] was vertical symmetry enforcement of the mouth image by averaging, followed by low pass filtering, subsampling, and thresholding. Another measure of the performance assessment is given Table 6: Confusion matrix for visual word recognition by the dynamic network of SVMs with delta features. Digit recognized One Two Three Four One 95.83% 0.00% 0.00% 4.17% Digit presented Two 0.00% 95.83% 4.17% 0.00% Three 16.66% 12.5% 70.83% 0.00% Four 0.00% 0.00% 0.00% 100% Table 7: Average human confusion matrix [7]. Digit recognized One Two Three Four One 89.36% 0.46% 8.33% 1.85% Digit presented Two 1.39% 98.61% 0.00% 0.00% Three 9.25% 3.24% 85.64% 1.87% Four 4.17% 0.46% 1.85% 93.52% by comparing the confusion matrix of the proposed system with the average human confusion matrix provided in [7]. The accuracy of the viseme segmentation that results from the best Viterbi lattices was computed using, as reference, the manually performed segmentation of frames into the viseme classes (Table 3) as a percentage of the correctly classified frames. We obtained an accuracy of 89.33%, which is just 1.27% lower than the WRR. The results obtained demonstrate that the SVM-based dynamic network is a very promising alternative to the existing methods for visual speech recognition. An improvement of the WRR is expected, when the training of the transition A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications 1257 Table 8: 95% confidence interval for the WRR of the proposed system compared to that of other techniques. Method SVM-based dynamic network without delta features SVM-based dynamic network with delta features AAM and HMM shape + intensity inner + outer lip contour without delta features [6] Confidence interval [%] [66.6,83.5] [83.1,94.7] [79.4,92.7] [83.1,94.7] Global PCA & HMMs [9] Global ICA & HMMs [9] blocked filter bank PCA/ICA (local) [9] unblocked filter bank PCA/ICA (local) [9] Diffusion network shape + intensity [8] [70.0,86.1] [64.4,81.7] [76.9,91.1] [84.4,95.7] [84.4,95.7] Method Confidence interval [%] probabilities is implemented and the trained transition probabilities are incorporated in the Viterbi decoding lattices. To assess the statistical significance of the rates observed, we model the ensemble {test patterns, recognition algorithm} as a source of binary events, 1 for correct recognition and 0 for an error, with a probability p of drawing a 1 and (1 − p) of drawing a 0. These events can be described by Bernoulli trials. We denote by p̂ the estimate of p. The exact ǫ confidence interval of p is the segment between the two roots of the quadratic equation [31]  p − p̂ 2 = 2 z(1+ ǫ)/2 p(1 − p), K (10) where zu is the u percentile of the standard Gaussian distribution having zero mean and unit variance, and K = 96 is the total number of tests conducted. We computed the 95% confidence intervals (ǫ = 0.95) for the WRR of the proposed approach and also for the WRRs reported in literature [6, 7, 8, 9], as summarized in Table 8. 5.3. Estimation of the SVM structure complexity The complexity of the SVM structure can be estimated by the number of SVMs needed for the classification of each word as a function of the number of frames T in the current word pronunciation. For the experiments reported here, if we take into account the total number of symbolic word models, that is, 16 and the number of possible states as a function of the frame index, we get: 6 SVMs for the classification of the first frame, 7 for the second one, 8 for the one before the last, 6 for the last one, and 9 SVMs for all remaining ones. This leads to a total of 9 × T − 9 SVMs. As we can see, the number of SVM outputs to be estimated at each time instant is not large. Therefore, the recognition could be done in real-time, since the number of frames per word is small (on the order of 10) in general. Of course, when scaling the system to a large vocabulary continuous speech recognition (LVCSR) application, a significantly larger number of context dependent viseme SVMs will be required, thus affecting both training and recognition complexity. 6. AAM and HMM HMMs [7] shape + intensity without delta inner + outer lip contour features with delta features [6] [49.9,69.2] HMMs [7] with delta features [82.3,94.5] CONCLUSIONS In this paper, we proposed a new method for a visual speech recognition task. We employed SVM classifiers and integrated them into a Viterbi decoding lattice. Each SVM output was converted to a posterior probability, and then the SVMs with probabilistic outputs were integrated into Viterbi lattices as nodes. We tested the proposed method on a small visual speech recognition task, namely the recognition of the first four digits in English. The features used were the simplest possible, that is, the raw gray level values of the mouth image and their temporal derivatives. Under these circumstances, we obtained a word recognition rate that competes with that of the state of the art methods. Accordingly, SVMs are found to be promising classifiers for visual speech recognition tasks. The existing relationship between the phonetic and visemic models can also lead to an easy integration of the visual speech recognizer with its audio counterpart. In our future research, we will try to improve the performance of the visual speech recognizer by training the state transition probabilities of the Viterbi decoding lattice. Another topic of interest in our future research would be the integration of this type of visual recognizer with an SVM-based audio recognizer to perform audio-visual speech recognition. ACKNOWLEDGMENT This work was supported by the European Union Research Training Network “Multimodal Human-Computer Interaction, Project No. HPRN-CT-2000-00111.” Mihaela Gordan is on leave from the Technical University of Cluj-Napoca, Faculty of Electronics and Telecommunications, Basis of Electronics Department, Cluj-Napoca, Romania. REFERENCES [1] T. Chen, “Audiovisual speech processing,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 9–21, 2001. [2] T. Chen and R. R. Rao, “Audio-visual integration in multimodal communication,” Proceedings of the IEEE, vol. 86, no. 5, pp. 837–852, 1998. 1258 [3] C. Benoı̂t, T. Lallouache, T. Mohamadi, and C. Abry, “A set of French visemes for visual speech synthesis,” in Talking Machines: Theories, Models, and Designs, G. Bailly and C. Benoı̂t, Eds., pp. 485–504, Elsevier-North Holland, Amsterdam, 1992. [4] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, and D. Vergyri, “Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins summer 2000 workshop,” in Proc. IEEE Workshop Multimedia Signal Processing, pp. 619–624, Cannes, France, 2001. [5] C. Bregler and S. Omohundro, “Nonlinear manifold learning for visual speech recognition,” in Proc. IEEE International Conf. on Computer Vision, pp. 494–499, Cambridge, Mass, USA, 1995. [6] J. Luettin and N. A. Thacker, “Speechreading using probabilistic models,” Computer Vision and Image Understanding, vol. 65, no. 2, pp. 163–178, 1997. [7] J. R. Movellan, “Visual speech recognition with stochastic networks,” in Advances in Neural Information Processing Systems, G. Tesauro, D. Toruetzky, and T. Leen, Eds., vol. 7, pp. 851– 858, MIT Press, Cambridge, Mass, USA, 1995. [8] J. R. Movellan, P. Mineiro, and R. J. Williams, “Partially observable SDE models for image sequence recognition tasks,” in Advances in Neural Information Processing Systems, T. Leen, T. G. Dietterich, and V. Tresp, Eds., vol. 13, pp. 880–886, MIT Press, Cambridge, Mass, USA, 2001. [9] M. S. Gray, T. J. Sejnowski, and J. R. Movellan, “A comparison of image processing techniques for visual speech recognition applications,” in Advances in Neural Information Processing Systems, T. Leen, T. G. Dietterich, and V. Tresp, Eds., vol. 13, pp. 939–945, MIT Press, Cambridge, Mass, USA, 2001. [10] Y. Li, S. Gong, and H. Liddell, “Support vector regression and classification based multi-view face detection and recognition,” in Proc. 4th IEEE Int. Conf. Automatic Face and Gesture Recognition, pp. 300–305, Grenoble, France, 2000. [11] T.-J. Terrillon, M. N. Shirazi, M. Sadek, H. Fukamachi, and S. Akamatsu, “Invariant face detection with support vector machines,” in Proc. 15th Int. Conf. Pattern Recognition, vol. 4, pp. 210–217, Barcelona, Spain, 2000. [12] A. Tefas, C. Kotropoulos, and I. Pitas, “Using support vector machines to enhance the performance of elastic graph matching for frontal face authentication,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 23, no. 7, pp. 735–746, 2001. [13] C. Kotropoulos, N. Bassiou, T. Kosmidis, and I. Pitas, “Frontal face detection using support vector machines and backpropagation neural networks,” in Proc. 2001 Scandinavian Conf. Image Analysis (SCIA ’01), pp. 199–206, Bergen, Norway, 2001. [14] A. Fazekas, C. Kotropoulos, I. Buciu, and I. Pitas, “Support vector machines on the space of Walsh functions and their properties,” in Proc. 2nd IEEE Int. Symp. Image and Signal Processing and Applications, pp. 43–48, Pula, Croatia, 2001. [15] I. Buciu, C. Kotropoulos, and I. Pitas, “Combining support vector machines for accurate face detection,” in Proc. 2001 IEEE Int. Conf. Image Processing, vol. 1, pp. 1054–1057, Thessaloniki, Greece, October 2001. [16] A. Ganapathiraju, J. Hamaker, and J. Picone, “Hybrid SVM/HMM architectures for speech recognition,” in Proc. Speech Transcription Workshop, College Park, Md, USA, 2000. [17] A. Rogozan, “Discriminative learning of visual data for audiovisual speech recognition,” International Journal on Artificial Intelligence Tools, vol. 8, no. 1, pp. 43–52, 1999. [18] A. J. Goldschen, Continuous automatic speech recognition by lipreading, Ph.D. thesis, George Washington University, Washington, DC, USA, 1993. EURASIP Journal on Applied Signal Processing [19] A. J. Goldschen, O. N. Garcia, and E. D. Petajan, “Rationale for phoneme-viseme mapping and feature selection in visual speech recognition,” in Speechreading by Humans and Machines: Models, Systems, and Applications, D. G. Stork and M. E. Hennecke, Eds., pp. 505–515, Springer-Verlag, Berlin, Germany, 1996. [20] J. R. Movellan and J. L. McClelland, “The Morton-Massaro law of information integration: Implications for models of perception,” Psychological Review, vol. 108, no. 1, pp. 113– 148, 2001. [21] V. N. Vapnik, Statistical Learning Theory, John Wiley, New York, NY, USA, 1998. [22] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, UK, 2000. [23] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book, Entropic, Cambridge, UK, 1999, HTK version 2.2. [24] J. T.-Y. Kwok, “Moderating the outputs of support vector machine classifiers,” IEEE Trans. Neural Networks, vol. 10, no. 5, pp. 1018–1031, 1999. [25] J. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” in Advances in Large Margin Classifiers, A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans, Eds., MIT Press, Cambridge, Mass, USA, 2000. [26] T. Hastie and R. Tibshirani, “Classification by pairwise coupling,” The Annals of Statistics, vol. 26, no. 1, pp. 451–471, 1998. [27] J. R. Deller, J. G. Proakis, and J. H. L. Hansen, DiscreteTime Processing of Speech Signals, Prentice-Hall, Upper Saddle River, NJ, USA, 1993. [28] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliffs, NJ, USA, 1993. [29] The Carnegie Mellon University Pronouncing Dictionary V. 0.6, http://www.speech.cs.cmu.edu/cgi-bin/cmudict. [30] T. Joachims, “Making large-scale SVM learning practical,” in Advances in Kernel Methods—Support Vector Learning, B. Scoelkopf, C. Burges, and A. Smola, Eds., MIT Press, Cambridge, Mass, USA, 1999. [31] A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw-Hill, New York, NY, USA, 3rd edition, 1991. Mihaela Gordan received the Diploma in electronics engineering in 1995 and the M.S. degree in electronics in 1996, both from the Technical University of ClujNapoca, Cluj-Napoca, Romania. Currently, she is working on her Ph.D. degree in electronics and communications at the Basis of Electronics Department of the Technical University of Cluj-Napoca where she serves as a Teaching Assistant since 1997. Ms. Gordan authored a number of 30 conference and journal papers and 1 book in her area of expertise. Her current research interests include applied fuzzy logic in image processing, pattern recognition, human-computer interaction, visual speech recognition, and support vector machines. Ms. Gordan is a student member of IEEE and member of the Signal Processing Society of IEEE since 1999. A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications Constantine Kotropoulos received the Diploma degree with honors in electrical engineering in 1988 and the Ph.D. degree in electrical and computer engineering in 1993, both from the Aristotle University of Thessaloniki. Since 2002, he has been an Assistant Professor in the Department of Informatics at the Aristotle University of Thessaloniki. From 1989 to 1993, he was an assistant researcher and teacher in the Department of Electrical & Computer Engineering at the same university. In 1995, after his military service in the Greek Army, he joined the Department of Informatics at the Aristotle University of Thessaloniki as a senior researcher and served then, as a Lecturer from 1997 to 2001. He has also conducted research in the Signal Processing Laboratory at Tampere University of Technology, Finland, during the summer of 1993. He is co-editor of the book “Nonlinear Model-Based Image/Video Processing and Analysis” (J. Wiley and Sons, 2001). His current research interests include multimodal human computer interaction, pattern recognition, nonlinear digital signal processing, neural networks, and multimedia information retrieval. Ioannis Pitas received the Diploma of electrical engineering in 1980 and the Ph.D. degree in electrical engineering in 1985, both from the University of Thessaloniki, Greece. Since 1994, he has been a Professor at the Department of Informatics, University of Thessaloniki. From 1980 to 1993, he served as Scientific Assistant, Lecturer, Assistant Professor, and Associate Professor in the Department of Electrical and Computer Engineering at the same University. He served as a Visiting Research Associate at the University of Toronto, Canada, University of Erlangen-Nuernberg, Germany, Tampere University of Technology, Finland, and as Visiting Assistant Professor at the University of Toronto. His current interests are in the areas of digital image processing, multidimensional signal processing and computer vision. He was Associate Editor of the IEEE Transactions on Circuits and Systems, IEEE Transactions on Neural Networks, and co-editor of Multidimensional Systems and Signal Processing and he is currently an Associate Editor of the IEEE Transactions on Image Processing. He was Chair of the 1995 IEEE Workshop on Nonlinear Signal and Image Processing (NSIP95), Technical Chair of the 1998 European Signal Processing Conference (EUSIPCO 98) and General Chair of the 2001 IEEE International Conference on Image Processing (ICIP 2001). 1259