A content-based image retrieval scheme allowing for robust automatic personalization

Sotirios Chatzis; Anastasios Doulamis; Theodora Varvarigou

A content-based image retrieval scheme allowing for robust automatic personalization

Proceedings of the 6th ACM international conference on Image and video retrieval - CIVR '07, 2007

A Content-Based Image Retrieval Scheme Allowing for Robust Automatic Personalization Sotirios Chatzis Electrical and Computer Engineering Department National Technical University of Athens 15772, Zografos, Athens, Greece stchat@telecom.ntua.gr Anastasios Doulamis Electrical and Computer Engineering Department National Technical University of Athens 15772, Zografos, Athens, Greece adoulam@cs.ntua.gr Theodora Varvarigou Electrical and Computer Engineering Department National Technical University of Athens 15772, Zografos, Athens, Greece dora@telecom.ntua.gr ABSTRACT The retrieval performance of content-based image retrieval (CBIR) systems is often disappointingly low, mainly due to the subjectivity of human perception. Relevance feed- back (RF) has been widely considered as a powerful tool to enhance CBIR systems by incorporating human perception subjectivity into the retrieval procedure. However, usually, the obtained feedback logs are scarce and contain lots of outliers, undermining the RF adaptation eﬀectiveness. In this paper, we tackle these shortcomings exploiting the in- herent outlier downweighting capabilities mixtures of Stu- dent’s t distributions oﬀer. Each semantic class is modeled by a mixture of t distributions ﬁtted to data provided by the system operators. Further, the semantic class models get personalized by application of a novel, eﬃcient RF al- gorithm allowing for the robust adaptation of the semantic class models to the accumulated feedback of each user. The eﬃcacy of our approach is validated through a series of ex- periments using objective performance criteria. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Informa- tion Search and Retrieval—clustering, relevance feedback, retrieval models General Terms Algorithms Keywords t distributions, mixture models, personalization 1. INTRODUCTION Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. CIVR’07, July 9–11, 2007, Amsterdam, The Netherlands Copyright 2007 ACM 978-1-59593-733-9/07/0007 ...$5.00. The unprecedented upsurge in multimedia databases has set oﬀ multimedia information retrieval as an important re- search topic for many computer science communities [17]. One of the key aspects of multimedia information retrieval is content-based image retrieval (CBIR). CBIR systems are faced with two key-challenges stemming from the high level- semantic nature and the subjectivity of the way humans per- ceive the content of images. The ﬁrst one is the semantic gap between the low-level visual features and high-level hu- man perception [17]. Humans perceive the content of images based on high-level semantic concepts. Despite the exten- sive eﬀorts, the formulation of techniques and mathematical models to eﬀectively extract and represent this type of in- formation based on the visual attributes of image pixels is extremely laborious and a general scope approach has yet to be proposed [20]. The second major challenge concerns the subjectivity of human perception. The way humans de- termine the content of an image is a rather nebulous proce- dure. Characteristic of its ill-deﬁned nature is the fact that the same individual might perceive the same semantic en- tities at diﬀerent times in a diﬀerent manner, let alone the case where diﬀerent individuals are considered [14]. Hence, personalization is one of the most important functions in de- signing successful CBIR systems, providing the mechanisms to make the system adaptable to the individual perception of its users [3]. Relevance feedback [13] provides the feasible means to mitigate the semantic gap between low-level image features and high-level semantic concepts by exploiting user-provided information to create successful mappings of low-level image features to high-level semantic concepts. Furthermore, rele- vance feedback allows for the eﬀective resolution of the hu- man perception subjectivity issue, allowing for the personal- ization of CBIR systems. Personalization of CBIR systems can be attained by adapting the retrieval models and crite- ria they use, individually to the feedback provided by each of their users. Relevance feedback techniques for CBIR sys- tems have evolved [7] from earlier heuristic weighting tech- niques [14], to optimal learning [5] and the more recent ma- chine learning techniques (e.g. [11, 19]). The majority of the proposed relevance feedback tech- niques for CBIR systems regard the problem as a strict two- class classiﬁcation problem, with equal treatments on both positive and negative examples. Although it is reasonable to assume that positive examples of a semantic class follow a 1

common distribution, this is not the case with negative ex- amples, which usually belong to multiple classes. Hence, it is almost impossible to estimate the real distribution of nega- tive images based on the relevant feedback. Extensive eﬀorts have been made towards the attenuation of this shortcom- ing (e.g. [21], [4]). One of the recently proposed, promising alternatives is the Gaussian mixture modeling of the dis- tribution of relevant images. In [11], each semantic class is represented by a Gaussian mixture model (GMM) ﬁtted using positive examples accumulated through the user feed- back. In [18], negative examples are also utilized to apply a similarity metric adaptation strategy. GMMs have been widely used by the pattern recogni- tion and machine learning communities. Their popularity stems from the fact that they provide a sound statistical framework for the approximation of unknown, non-Gaussian distributions, including distributions with multiple modes. However, GMMs suﬀer by a signiﬁcant drawback: the esti- mation of their parameters can be severely aﬀected by the presence of outliers in the data sample used. The prob- lem of providing protection against outliers in multivariate data is a very diﬃcult problem and increases in diﬃculty with the dimension of the data [12]. The replacement of Gaussian distributions with the, longer-tailed, Student’s t distributions has been proposed recently as a way to suc- cessfully overcome these hurdles [12,16], providing a math- ematically sound mechanism for the eﬀective conduction of outlier downweighting, within a well-founded statistical con- text. Relevance feedback algorithms assume that their users are consistent when performing their relevance judgements. However, user consistency is extremely diﬃcult to achieve. In general, users tend to provide limited and conﬂicting judgements during relevance feedback iterations, resulting in scarce feedback logs containing lots of outliers, that might aﬀect severely the eﬃcacy of a GMM-based CBIR frame- work. To eﬀectively address these shortcomings, we pro- pose in this paper the use of mixtures of t densities to rep- resent the distributions of the images corresponding to the considered semantic classes. Further, we apply a novel rele- vance feedback algorithm to adapt these models to the user- provided feedback. Performing this procedure separately for each user, we achieve the robust adaptation of the retrieval procedure of our CBIR system to the perception of each user, achieving the successful mitigation of the semantic gap issue in conjunction with the eﬀective and eﬃcient person- alization of our CBIR system. We evaluate the eﬃcacy of our methodology using a real-life database on the basis of objective performance criteria. The remainder of this paper is organized as follows: In Section 2, we brieﬂy provide the theoretical background of tMMs. In Section 3, we provide a comprehensive description of the proposed CBIR frame- work. In Section 4, the experimental evaluation of our sys- tem is conducted and discussed. Section 5 concludes this paper. 2. THEORETICAL BACKGROUND: MIX- TURE MODELS OF T DISTRIBUTIONS 2.1 t-distributed Mixture Models (tMMs) We let X1„..., Xn denote a random sample of size n on a p-dimensional random vector. We assume that each observa- tion Xj of the random sample X1, X2,..., Xn is generated by a set (mixture) of t distributions, consisting of g compo- nents, with (prior) probabilities ci (i =1,..,g), means μ i , positive deﬁnite inner product matrices Σi , and νi degrees of freedom, i.e Xj ∼ t(μ i , Σi ,νi ) with probability ci (1) and hence f (xj ; Θ)= g X i=1 ci f (xj ; μ i , Σi ,νi ) (2) where t(μ, Σ,ν ) is a t distribution with mean μ, inner prod- uct matrix Σ, and ν degrees of freedom; the pdf of t(μ, Σ,ν ) is given by f (xj ; μ,Σ,ν )= Γ ` ν+p 2 ´ |Σ| -1/2 (πν ) p/2 Γ (ν/2){1+ d(xj , μ; Σ)/ν } (ν+p)/2 (3) where, d(xj , μ; Σ) is the squared Mahalanobis distance d(xj , μ; Σ)=(xj − μ) T Σ -1 (xj − μ) and Θ comprises the ci , the νi , and the elements of μ i and Σi . Alternatively, using the properties of t distributions, we obtain that Xj |uj ∼ N (μ i , Σi /uj ) (4) where the scalar Uj is a random variable such that Uj ∼ Γ (ν/2,ν/2), Γ (α,β) is the gamma distribution, with pdf p(u; α,β)= u α-1 e -u/β β α Γ (a) , and N (μ, Σ) stands for a normal distribution with mean μ and covariance matrix Σ. The value uj of the random variable Uj is used as the weighting factor of the data during the estimation procedure of the model parameters and hence, is the factor downweighting the outliers in the model parameters estimation. 2.2 ML Estimation of tMM Parameters The Maximum Likelihood (ML) treatment of a t MM com- prises the calculation of the ML estimator ˆ Θ of the model parameters vector Θ given a random sample of ﬁtting data. The Expectation-Maximization (EM) algorithm is a power- ful iterative procedure for the computational conduction of the ML treatment of statistical models. The ML treatment of tMMs has been conducted by Peel et al. in [12]. The EM algorithm comprises the maximization of an in- termediate quantity, the conditional expected value of the complete data log-likelihood, given the random sample x = (x1,..., xn). Here, for each j =1,...,n, the datum xj is viewed as a partial observation of the “complete” data, and we let the missing data be the scalars u1,..., un, and the component-indicator vectors z1,..,zn, where zj =(zij ) and zij =1 if Xj is viewed as deriving from the i-th component density of the model, zij =0 otherwise. Then, the complete data log-likelihood is given by logLc(Θ)= g X i=1 n X j=1 zij log{ci f (xj ; μ i , Σi ,νi )} (5) 2.2.1 E-step The E-step on the (k +1)-th iteration requires the cal- culation of the conditional expectation Q(Θ; Θ (k) ) of the complete-data log likelihood (5) given the random sample 2

A Content-Based Image Retrieval Scheme Allowing for Robust Automatic Personalization Sotirios Chatzis Anastasios Doulamis Theodora Varvarigou Electrical and Computer Engineering Department National Technical University of Athens 15772, Zografos, Athens, Greece Electrical and Computer Engineering Department National Technical University of Athens 15772, Zografos, Athens, Greece Electrical and Computer Engineering Department National Technical University of Athens 15772, Zografos, Athens, Greece stchat@telecom.ntua.gr adoulam@cs.ntua.gr ABSTRACT dora@telecom.ntua.gr The unprecedented upsurge in multimedia databases has set off multimedia information retrieval as an important research topic for many computer science communities [17]. One of the key aspects of multimedia information retrieval is content-based image retrieval (CBIR). CBIR systems are faced with two key-challenges stemming from the high levelsemantic nature and the subjectivity of the way humans perceive the content of images. The first one is the semantic gap between the low-level visual features and high-level human perception [17]. Humans perceive the content of images based on high-level semantic concepts. Despite the extensive efforts, the formulation of techniques and mathematical models to effectively extract and represent this type of information based on the visual attributes of image pixels is extremely laborious and a general scope approach has yet to be proposed [20]. The second major challenge concerns the subjectivity of human perception. The way humans determine the content of an image is a rather nebulous procedure. Characteristic of its ill-defined nature is the fact that the same individual might perceive the same semantic entities at different times in a different manner, let alone the case where different individuals are considered [14]. Hence, personalization is one of the most important functions in designing successful CBIR systems, providing the mechanisms to make the system adaptable to the individual perception of its users [3]. Relevance feedback [13] provides the feasible means to mitigate the semantic gap between low-level image features and high-level semantic concepts by exploiting user-provided information to create successful mappings of low-level image features to high-level semantic concepts. Furthermore, relevance feedback allows for the effective resolution of the human perception subjectivity issue, allowing for the personalization of CBIR systems. Personalization of CBIR systems can be attained by adapting the retrieval models and criteria they use, individually to the feedback provided by each of their users. Relevance feedback techniques for CBIR systems have evolved [7] from earlier heuristic weighting techniques [14], to optimal learning [5] and the more recent machine learning techniques (e.g. [11, 19]). The majority of the proposed relevance feedback techniques for CBIR systems regard the problem as a strict twoclass classification problem, with equal treatments on both positive and negative examples. Although it is reasonable to assume that positive examples of a semantic class follow a The retrieval performance of content-based image retrieval (CBIR) systems is often disappointingly low, mainly due to the subjectivity of human perception. Relevance feedback (RF) has been widely considered as a powerful tool to enhance CBIR systems by incorporating human perception subjectivity into the retrieval procedure. However, usually, the obtained feedback logs are scarce and contain lots of outliers, undermining the RF adaptation effectiveness. In this paper, we tackle these shortcomings exploiting the inherent outlier downweighting capabilities mixtures of Student’s t distributions offer. Each semantic class is modeled by a mixture of t distributions fitted to data provided by the system operators. Further, the semantic class models get personalized by application of a novel, efficient RF algorithm allowing for the robust adaptation of the semantic class models to the accumulated feedback of each user. The efficacy of our approach is validated through a series of experiments using objective performance criteria. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—clustering, relevance feedback, retrieval models General Terms Algorithms Keywords t distributions, mixture models, personalization 1. INTRODUCTION Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIVR’07, July 9–11, 2007, Amsterdam, The Netherlands Copyright 2007 ACM 978-1-59593-733-9/07/0007 ...$5.00. 1 tion X j of the random sample X 1 , X 2 ,..., X n is generated by a set (mixture) of t distributions, consisting of g components, with (prior) probabilities ci (i = 1, .., g), means µi , positive definite inner product matrices Σi , and νi degrees of freedom, i.e common distribution, this is not the case with negative examples, which usually belong to multiple classes. Hence, it is almost impossible to estimate the real distribution of negative images based on the relevant feedback. Extensive efforts have been made towards the attenuation of this shortcoming (e.g. [21], [4]). One of the recently proposed, promising alternatives is the Gaussian mixture modeling of the distribution of relevant images. In [11], each semantic class is represented by a Gaussian mixture model (GMM) fitted using positive examples accumulated through the user feedback. In [18], negative examples are also utilized to apply a similarity metric adaptation strategy. GMMs have been widely used by the pattern recognition and machine learning communities. Their popularity stems from the fact that they provide a sound statistical framework for the approximation of unknown, non-Gaussian distributions, including distributions with multiple modes. However, GMMs suffer by a significant drawback: the estimation of their parameters can be severely affected by the presence of outliers in the data sample used. The problem of providing protection against outliers in multivariate data is a very difficult problem and increases in difficulty with the dimension of the data [12]. The replacement of Gaussian distributions with the, longer-tailed, Student’s t distributions has been proposed recently as a way to successfully overcome these hurdles [12, 16], providing a mathematically sound mechanism for the effective conduction of outlier downweighting, within a well-founded statistical context. Relevance feedback algorithms assume that their users are consistent when performing their relevance judgements. However, user consistency is extremely difficult to achieve. In general, users tend to provide limited and conflicting judgements during relevance feedback iterations, resulting in scarce feedback logs containing lots of outliers, that might affect severely the efficacy of a GMM-based CBIR framework. To effectively address these shortcomings, we propose in this paper the use of mixtures of t densities to represent the distributions of the images corresponding to the considered semantic classes. Further, we apply a novel relevance feedback algorithm to adapt these models to the userprovided feedback. Performing this procedure separately for each user, we achieve the robust adaptation of the retrieval procedure of our CBIR system to the perception of each user, achieving the successful mitigation of the semantic gap issue in conjunction with the effective and efficient personalization of our CBIR system. We evaluate the efficacy of our methodology using a real-life database on the basis of objective performance criteria. The remainder of this paper is organized as follows: In Section 2, we briefly provide the theoretical background of tMMs. In Section 3, we provide a comprehensive description of the proposed CBIR framework. In Section 4, the experimental evaluation of our system is conducted and discussed. Section 5 concludes this paper. X j ∼ t(µi , Σi , νi ) with probability ci (1) and hence f (xj ; Θ) = g X ci f (xj ; µi , Σi , νi ) (2) i=1 where t(µ, Σ, ν) is a t distribution with mean µ, inner product matrix Σ, and ν degrees of freedom; the pdf of t(µ, Σ, ν) is given by ` ´ Γ ν+p |Σ|−1/2 2 f (xj ; µ,Σ, ν) = (πν)p/2 Γ (ν/2){1 + d(xj , µ; Σ)/ν}(ν+p)/2 (3) where, d(xj , µ; Σ) is the squared Mahalanobis distance d(xj , µ; Σ) = (xj − µ)T Σ−1 (xj − µ) and Θ comprises the ci , the νi , and the elements of µi and Σi . Alternatively, using the properties of t distributions, we obtain that X j |uj ∼ N (µi , Σi /uj ) (4) where the scalar Uj is a random variable such that Uj ∼ Γ (ν/2, ν/2), Γ (α, β) is the gamma distribution, with pdf −u/β p(u; α, β) = uα−1 βeα Γ (a) , and N (µ, Σ) stands for a normal distribution with mean µ and covariance matrix Σ. The value uj of the random variable Uj is used as the weighting factor of the data during the estimation procedure of the model parameters and hence, is the factor downweighting the outliers in the model parameters estimation. 2.2 ML Estimation of tMM Parameters The Maximum Likelihood (ML) treatment of a tMM comprises the calculation of the ML estimator Θ̂ of the model parameters vector Θ given a random sample of fitting data. The Expectation-Maximization (EM) algorithm is a powerful iterative procedure for the computational conduction of the ML treatment of statistical models. The ML treatment of tMMs has been conducted by Peel et al. in [12]. The EM algorithm comprises the maximization of an intermediate quantity, the conditional expected value of the complete data log-likelihood, given the random sample x = (x1 , ..., xn ). Here, for each j = 1, ..., n, the datum xj is viewed as a partial observation of the “complete” data, and we let the missing data be the scalars u1 ,..., un , and the component-indicator vectors z 1 ,..,z n , where z j = (zij ) and zij = 1 if X j is viewed as deriving from the i-th component density of the model, zij = 0 otherwise. Then, the complete data log-likelihood is given by logLc (Θ) = 2. THEORETICAL BACKGROUND: MIXTURE MODELS OF T DISTRIBUTIONS g n X X zij log{ci f (xj ; µi , Σi , νi )} (5) i=1 j=1 2.2.1 E-step 2.1 t-distributed Mixture Models (tMMs) We let X 1 „..., X n denote a random sample of size n on a p-dimensional random vector. We assume that each observa- The E-step on the (k +1)-th iteration requires the calculation of the conditional expectation Q(Θ; Θ(k) ) of the complete-data log likelihood (5) given the random sample 2 3. THE PROPOSED FRAMEWORK x, using the current estimator Θ(k) for Θ Q(Θ; Θ(k) ) = E(logLc (Θ)|x; Θ(k) ) (6) 3.1 As it has been shown in [12], the estimation of this quantity is reduced to the computation of the component distribution membership posterior probabilities of the data (k) (k) (k) The images residing in the database of a CBIR system can be viewed as belonging to different semantic classes. Such semantic classes might be, for example, building, sunset, elephant, and so forth. In the proposed framework, the operators define initially a number of semantic classes that the images residing in its database shall be classified into. Further, for each semantic class of image content an appropriate training dataset is selected by the system operators and used to fit a tMM model, given by (2). These datasets comprise the feature vectors of images considered as characteristic of each semantic class and consist of common image descriptors, such as color, texture and shape. The tMM models fitted to these data shall be referred to as the global models of the corresponding semantic classes. The estimation of the tMM model representing each semantic class is conducted under an ML framework using the EM algorithm presented in section 2.2. The trained tMM models are used for the classification of the images residing in our system’s database into the considered semantic classes. In detail, initially, each image residing in the database of our system is processed, under the same procedure as the one applied to the training datasets, to extract its feature vector. Further, using the derived feature vectors, the images are classified into the considered semantic classes represented by the fitted tMMs, on the basis of a Maximum A Posteriori probability (MAP) classification procedure. To conduct content-based image retrieval, the user of our system is asked to enter a query image. The system processes the query image in the same way it processes the images residing in its database to extract their feature vectors and classifies it in a semantic image class using the MAP classification methodology, as described above. Finally, the system returns to the user the top M images residing in its database that have been classified to the same semantic class as the query image, ranked in a descending order of their likelihood with respect to the model of the class they belong to. A basic aspect of the proposed CBIR system is the notion of personalization. Personalization is effected in our system by adapting the global semantic class models to the relevance feedback provided by each user. The benefits our system yields by the application of this procedure are twofold. First, the system exploits user interaction to help improve its mapping of low-level visual image features to high-level semantic concepts, and hence, mitigate the semantic gap issue. Second, the application of the relevance feedback procedure on a per user basis provides the effective means for the robust adaptation of the system retrieval procedure to the individual perception of each user (system personalization). The relevance feedback adaptation of the global models, applied to acquire the target distributions of each user, is conducted on the basis of a novel relevance feedback algorithm for tMMs, that we shall describe in the following subsection. The proposed relevance feedback algorithm, exploiting the merits of t distributions, offers a robust against outliers model updating mechanism, based on a well founded statistical concept. This way, we offer a sound mathematical framework for the attenuation of the well-known issues faced (k) c f (xj ; µi , Σi , νi ) (k) rij = E(Zij |xj ; Θ(k) ) = Pg i (k) (k) (k) (k) h=1 ch f (xj ; µh , Σh , νh ) (7) and of the posterior expected values of the scalars Uj given the component distributions they derive from (k) νi (k) uij = E(Uj |xj , zij = 1; Θ(k) ) = (k) νi +p (k) (k) + d(xj , µi ; Σi ) (8) 2.2.2 M-step On the M-step on the (k +1)-th iteration the expressions (k+1) (k+1) (k+1) (k+1) of ci , µi , νi and Σi are computed by max(k) imizing Q(Θ; Θ ) over each one of them. As it has been shown in [12] this yields (k+1) = ci n X (k) (9) rij /n (i = 1, ..., g) j=1 (k+1) µi n X = (k) (k) rij uij xj / j=1 n X (k) (k) (10) rij uij j=1 and (k) Σi (k) (k) Pn j=1 = (k) (k+1) while the estimator of νi −ψ “ν ” i 2 + log + 1 + Pn (k) rij uij (xj − µi )(xj − µi )T Pn (k) j=1 rij “ν ” i 2 +ψ 1 j=1 (k) rij n X (11) is the solution of the equation ! ! (k) (k) νi + p νi + p − log + 2 2 (k) rij j=1 “ ” (k) (k) loguij − uij = 0 (12) where ψ(s) is the digamma function, ψ(s) = ∂logΓ (s)/∂s. The solution of this equation does not exist in closed form [16]. However, a good closed form approximation of its solution would suffice for the effective and efficient estimation of a tMM. In this work we adopt a successful approximation of (12) presented in [16], which under the assumption νη = νζ = ν ∀ζ 6= η = 1, .., g, gives ν as ««« „ „ „ 2.1971 ν (k+1) = 0.0416 1 + erf 0.6594log τ + logτ − 1 2 (13) + τ + logτ − 1 where τ is defined as τ ,− « » „ (k) g n 1 X X (k) ν +p + rij ψ n i=1 j=1 2 + log 2 (k) (k) ν (k) + d(xj , µi ; Σi ) ! − (k) uij # Semantic Class Modeling and Progressive Learning Process (14) 3 (data X 1 , ..., X n ) and the subsequent replacement of the initial model (2) means and prior probabilities (mixing proportions) by the newly estimated ones. by relevance feedback algorithms, concerning the scarcity of user feedback logs and the inconsistency of the way users provide their feedback. 3.2 3.2.1 E-Step Relevance Feedback Algorithm The E-step on the (k +1)-th iteration of the relevance feedback adaptation of the tMM model (2) requires the calculation of the intermediate quantity Q(Ψ; Ψ(k) ), where Given a query image q, our system assigns it to a semantic image class under a MAP classification notion, as described in section 3.1. The tMM model representing this class is further adapted to the positive examples provided by the user during the relevance feedback iterations, to acquire the user’s target distribution of this semantic class. In this paper, we propose a novel algorithm for the relevance feedback adaptation of tMM models. Let us consider a tMM model representing a semantic class given by equation (2) f (xj ; Θ) = g X Q(Ψ; Ψ(k) ) = E(logLc (Ψ)|x; Ψ(k) ) is the conditional expectation of the complete-data log likelihood (16) given the random sample x, using Ψ(k) for Ψ. From (17) and (16) we obtain (see Appendix A) that the estimation of (17) is reduced to the computation of the component-distribution membership posterior probabilities of the relevant data ci f (xj ; µi , Σi , νi ) i=1 g X ci f (xj ; Ai µi + bi , Σi , νi ) (k) (k) rij = Pg with priors ci , νi degrees of freedom, means µi and positive definite inner product matrices Σi . The proposed algorithm comprises the EM fitting of the model f (xj ; Ψ) = (k) (k) (k) (k) (k) (k) (k) (k) ch f (xj ; Ah µh + bh , Σh , νh ) (18) and of the posterior expected values of the scalars Uj of these data given the component distributions they derive from (15) (k) νi (k) uij = where ci , µi , Σi and νi , i = 1, ..., g are the prior probabilities, means, inner product matrices and degrees of freedom of the considered tMM model of the form (2), respectively. Considering as the complete data the observations xj , the scalars u1 ,..., un , and the component-indicator vectors z 1 ,..,z n , we yield that the expression of the complete data log-likelihood of the model (15) is given by g n X X (k) ci f (xj ; Ai µi + bi , Σi , νi ) h=1 i=1 logLc (Ψ) = (17) (k) νi +p (k) (k) (19) (k) + d(xj , Ai µi + bi ; Σi ) 3.2.2 M-Step On the M-step on the (k +1)-th iteration of the relevance (k+1) (k+1) feedback adaptation, the expressions of ci , Ai and (k+1) (k) bi are computed by maximizing Q(Ψ; Ψ ) over each one of them. It can be shown that this procedure yields (see Appendix B) zij log{ci f (xj ; Ai µi + bi , Σi , νi )} (16) (k+1) ci i=1 j=1 = n X (k) (20) rij /n (i = 1, ..., g) j=1 where the parameter vector Ψ comprises the ci along with the elements of the Ai and bi , where Ai is a diagonal p × p matrix and bi is a p × 1 vector. The estimation of Ψ is conducted by fitting the model (15) to the relevance feedback data. Hence, we introduce an affine probabilistic model for the adaptation of the means and the priors of each component distribution of the initial tMM corresponding to a semantic class, while considering the Σi and the νi fixed to their initial values. Our approach of updating only the prior probabilities and the means of the component distributions is aimed to allow for the robust and efficient adaptation of the initial tMMs, given the very limited number of relevant feedback data, avoiding overfitting effects, and is motivated by results from speech processing research literature, indicating that the major discriminative information of a mixture model is retained by the mean vectors instead of the covariance matrices [9], and also under the consideration that the determination of the exact value of the component-distribution degrees of freedom is not of vital importance for the effective estimation of a tMM, especially in cases of big values for the degrees of freedom [16]. Let us suppose that the relevance feedback data provided by some user regarding a semantic class of image content comprises the feature vectors X 1 , ..., X n . Then, the proposed relevance feedback algorithm for the adaptation of the corresponding tMM model to these data, comprises the EM fitting of the model (15) to the provided relevant feedback (k+1) bi = n X (k) (k) rij uij j=1 (k+1) Ai = diag n “ ” X (k) (k) (k) xj − Ai µi / rij uij (" n X (k) (k) rij uij (xj − j=1 × " n X j=1 (21) j=1 (k) (k) rij uij µi µTi (k+1) bi )µTi # #−1 ) × (22) 3.2.3 Model Parameters Update After the convergence of the EM fitting of model (15) to the logged feedback data, the parameters of the initial tMM of type (2), modeling the semantic class under consideration, are updated as follows: 1. The component distribution priors are updated to the newly computed ones (eq. (20)). 2. The means µi are updated to Âi µi + bˆi , where Âi and bˆi are the estimators obtained by the RF model fitting (eq. (21),(22)). 3. All the other parameters of the initial model remain fixed to their initial values. 4. EXPERIMENTAL EVALUATION 4 4.1 Experimental Setup retrieval precision of the system is then given by A subset of the database of the National Technical University of Athens, including still images in JPEG format is used in our case for conducting the experiments. The overall data set consists of around 5 000 data (images) covering a wide variety of content. All data have been human annotated by domain professionals and classified into 12 categories according to their content, which namely are spaceships, tigers and lions, cars, dogs, fishes, cats, airplanes, buildings, boats and ships, birds, flags and coins. Fig. 1 illustrates some randomly selected data from the categories “Birds”, “Cars” and “Airplanes” of our database. Two different types of descriptors are used for the representation of the visual content of the used images; the global-based descriptors, referring to global visual characteristics, and the object-based descriptors, exploiting regionbased properties obtained by applying a segmentation algorithm. The used global-based image descriptors, comprise global color and texture. The color feature extracted in our experiments is the color moment. We prefer it due to the fact that it is close to natural human perception, whose effectiveness in CBIR has been shown in many previous research studies [18]. Three different color moments are used: color mean, color variance, and color skewness in each color channel (H, S, and V), respectively. Texture information is extracted by employing the wavelet-based texture extraction technique [10]. The considered object-based image descriptors are extracted by conducting color segmentation using a multiresolution implementation of the Recursive Shortest Spanning Tree algorithm (RSST). RSST is preferred due to its efficiency and low computational complexity [2]. For each color segment, the average color, size and segment location are extracted as appropriate descriptors. Each one of the considered features characterizes the type of image content in a unique, powerful way. These global-based and objectbased features are eventually combined by our system into a feature vector which is normalized into a standardized normal distribution. Initially, we use the 30% of the available human annotated images, classified to the considered semantic classes, to fit a tMM per semantic class, obtaining the global models of the considered semantic classes. Further, we classify the rest of the images of our database to the considered semantic classes using the trained tMMs under a MAP classification notion, as we have explained in section 3.1. Finally, we conduct a series of relevance feedback iterations regarding each semantic class to evaluate the efficacy of the proposed relevance feedback algorithm. The assessment of the retrieval performance of the proposed system is conducted using two objective evaluation criteria, the Precision-Recall Curve and the Average Normalized Modified Retrieval Rank (ANMRR) measure. 4.2 Q 1 X N (q) P¯r = Q q=1 M (q) (23) On the other hand, the retrieval recall Re(q) of a system with respect to a query q is the ratio of the number of retrieved relevant images, N (q), over the total number of relevant images in the database for the respective query, G(q) [15]. Given a set of Q queries, the average retrieval recall of the system is then given by Q X N (q) ¯ = 1 Re Q q=1 G(q) (24) It is a common place in information retrieval literature that in practical content-based retrieval systems, as the number of images returned to the user increases, precision decreases, while recall increases. Due to this fact, instead of using average precision or recall as separate performance measures for CBIR systems, the precision-recall curve is usually adopted. 4.2.2 Average Normalized Modified Retrieval Rank (ANMRR) Criterion Another popular quantitative criterion is the ANMRR measure, derived from the MPEG-7 core experiment [1]. ANMRR is an estimation of the number of relevant images retrieved and of their ranking among the retrievals. To define the ANMRR measure, we have firstly to define the Average Retrieval Rank (ARR). Given a query q, the ARR measure is defined as G(q) ARR(q) = X r(i) G(q) i=1 (25) where r(i), i = 1, ..., G(q) is the ranking of all relevant images returned in the top M retrievals and the value of M + 1 of all the missed relevant images. The measure M is defined as M = min{4∗G(q), 2∗GT M }, where GT M = max{G(q)} over all Q queries submitted to the system. Then, the Modified Retrieval Rank (MRR) is defined as M RR(q) = ARR(q) − 1 (G(q) + 1) 2 (26) The MRR metric is further normalized to the range [0, 1] yielding the Normalized Modified Retrieval Rank (NMRR) N M RR(q) = M RR(q) M + 0.5 − 0.5G(q) (27) Finally, the Average Normalized Modified Retrieval Rank (ANMRR) is defined as the average NMRR over the set of all available queries Q, yielding an effective overall retrieval performance criterion Objective Evaluation Criteria AN M RR = 4.2.1 Precision-Recall Curve Q 1 X N M RR(q) Q i=1 (28) Low values of ANMRR denote a high retrieval rate, with the relevant images ranked at the top. On the other hand, a value of ANMRR equal to one represents the worst possible retrieval performance with none of the relevant items in the database being present in the top retrievals. The retrieval precision P r(q) of a system with respect to a query q is defined as the ratio of the number of retrieved relevant images, N (q), over the number of total retrieved images, M (q) [15]. Given a set of Q queries, the average 5 (a) (b) (c) Figure 1: Representative images of three different categories of our image database. (a) “Birds” category. (b) “Cars” category. (c) “Airplanes” category. (a) (b) Figure 2: (a) Precision-Recall curves and (b) Precision values versus the number of feedback iterations, for the methods presented in [6], [8] and the proposed novel method 6 tation algorithm for tMMs. Table 1: ANMRR Measure of the Proposed Scheme Compared With Other Works for Relevance Feedback Relevance Feedback Algorithms ANMRR The proposed method 0.038% The method of [6] 0.07% The method of [8] 0.19% 4.3 The experimental evaluation of the proposed framework, indicates the effectiveness of our method, yielding a notably higher retrieval performance comparing to competing CBIR techniques. 6. REFERENCES [1] MPEG-7 Visual Part of eXperimentation Model Version 2.0. ISO/MPEG MPEG-7 Output Document, 1999. [2] Y. Avrithis, N. Doulamis, A. Doulamis, and S. Kollias. Optimization methods for key-frames and scenes extraction. Comput. Vis. Image Understanding, 75(1/2):3–24, Jul./Aug. 1999. [3] N. Babaguchi, K. Ohara, and T. Ogura. Effect of personalization on retrieval and summarization of sports video. In Joint Conference of the Fourth International Conference on Information, Communications and Signal Processing, and the Fourth Pacific Rim Conference on Multimedia, volume 2, pages 940–944, 2003. [4] Y. Chen, X. S. Zhou, and T. Huang. One-class SVM for learning in image retrieval. In Proc. IEEE Int’l Conf. Image Processing, 2001, volume 1, pages 34–37, 2001. [5] I. Cox, M. L. Miller, S. M. Omohundro, and P. N. Yianilos. Pichunter: Bayesian relevance feedback for image retrieval. In Proc. Int. Conf. Pattern Recognition, 1996, volume 3, pages 362–369, 1996. [6] A. Doulamis and N. Doulamis. Generalized nonlinear relevance feedback for interactive content-based retrieval and organization. IEEE Trans. Circuits Syst. Video Technol., 14(5):656–671, 2004. [7] T. S. Huang and X. S. Zhou. Image retrieval by relevance feedback: from heuristic weight adjustment to optimal learning methods. In Proc. IEEE Int’l Conf. Image Processing, 2001, volume 3, pages 2–5, 2001. [8] Y. Ishikawa, R. Subramanya, and C. Faloutsos. Mindreader: Query databases through multiple examples. In Proc. 24th VLDB Conf, pages 218–227, 1998. [9] C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Computer Speech & Language, 9(2):171–185, 1995. [10] B. Manjunath, P. Wu, S. Newsam, and H. Shin. A texture descriptor for browsing and similarity retrieval. J. Signal Processing: Image Comm., 16:33–42, 2000. [11] P. Muneesawang and L. Guan. Image retrieval with embeded sub-class information using Gaussian mixture models. In Proc. Int. Conference Multimedia and Expo, 2003, volume 1, pages 769–772, 2003. [12] D. Peel and G. J. McLachlan. Robust mixture modeling using the t distribution. Statistics and Computing, 10(4):339–348, 2000. [13] Y. Rui, T. Huang, and S. Mehrotra. Content-based image retrieval with relevance feedback in MARS. In Proc. IEEE Int’l Conf. Image Processing, 1997, volume 2, pages 815–818, 1997. [14] Y. Rui, T. Huang, M. Ortega, and S. Mehrotra. Relevance feedback: A power tool for interactive Performance Assessment After obtaining the global models of the semantic classes, we conduct a series of relevance feedback adaptation rounds to assess the performance of the proposed relevance feedbackenhanced CBIR system, as described in section 4.1. To obtain some comparative results, we conduct the same series of relevance feedback iterations using the methods proposed in [6] and [8]. In Fig. 2 we illustrate the obtained precisionrecall curves as well as the precision yielded by the different methods as a function of the number of the completed relevance feedback iterations, for recall equal to 35%. Table 1 depicts the yielded ANMRR values for the examined methods. As we we notice, the proposed tMM-based CBIR system offers superior retrieval performance comparing to its competitors. We also mention that the proposed relevance feedback adaptation algorithm for tMMs achieves the rapid enhancement of the retrieval precision of our system while requiring a minimal user interaction, both in terms of the number of relevance feedback iterations and the number of the provided feedback samples, completely outperforming its competitors. Finally, the average required number of EM-algorithm iterations, until convergence, per relevance feedback iteration, using a convergence threshold equal to 10−5 , is 5.03 iterations, yielding a significantly low computational complexity for the proposed relevance feedback model adaptation algorithm. 5. CONCLUSIONS In this work we have proposed a novel probabilistic framework for content-based image retrieval. Initially, the considered semantic classes of images are modeled using mixture models of t distributions fitted to data provided by the system operators, deriving the so-called global models of the considered semantic classes. Further, a novel, efficient, robust relevance feedback algorithm is applied for the adaptation of the global semantic class models to the feedback provided by each user. This way, the representation of the considered semantic classes is adapted to the individual perception of each user, allowing for the effective personalization of our system’s retrieval criteria, with a minimal user interaction. The major contributions of this work are 1. the introduction of a robust relevance feedback framework for content-based image retrieval (CBIR), which, exploiting the inherent outlier downweighting capabilities mixtures of t distributions offer, provides the effective means to resolve the common outlier vulnerabilityrelated problems usual relevance feedback algorithms suffer by, 2. the provision of an efficient relevance feedback adap- 7 [15] [16] [17] [18] [19] [20] [21] content-based image retrieval. IEEE Trans. Circuits Syst. Video Technol., 8:644–655, 1998. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1982. S. Shoham. Robust clustering by deterministic agglomeration EM of mixtures of multivariate t distributions. Statistics and Computing, 35(55):1127–1142, 2002. A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000. Z. Su, H. Zhang, S. Li, and S. Ma. Relevance feedback in content-based image retrieval: Bayesian framework, feature subspaces, and progressive learning. IEEE Trans. Image Processing, 12(8):924–937, 2003. N. Vasconcelos and A. Lippman. Learning from user feedback in image retrieval systems. In Proceedings of Neural Information Processing Systems 12, volume 1, 1999. N. Vasconcelos and A. Lippman. Statistical models of video structure for content analysis and characterization. IEEE Trans. Image Processing, 9:3–19, 2000. X. S. Zhou and T. Huang. Small sample learning during multimedia retrieval using biasmap. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2001, volume 1, pages 11–17, 2001. and where ” 1 “ Q2j (ξ i ; Ψ(k) ) = − E Uj |xj , zij = 1; Ψ(k) × 2 T −1 × (xj − bi )T Σ−1 i (xj − bi ) + (xj − bi ) Σi Ai × “ ” h 1 × E Uj |xj , zij = 1; Ψ(k) µi − trace ATi Σ−1 i Ai × i2 “ ” ×E Uj |xj , zij = 1; Ψ(k) µi µTi (35) B. M-STEP First of all, let us consider the maximization of Q1 (c; Ψ(k) ), P under the constraint gi=1 ci = 1. Using a Lagrange multiplier λ to enforce the constraint we have !# " g n X X ∂ rij ch − 1 = −λ=0 Q1 − λ ∂ci ci j=1 h=1 which gives (20). To derive the expression of the ML estimator of bi we have to perform the maximization of Q2 (ξ; Ψ(k) ) w.r.t. bi . From (34),(35) we have that the expression of Q2 (ξ; Ψ(k) ), ignoring terms not containing bi , is given by g n 1 X X (k) (k) h r u −2xTj Σ−1 i bi + 2 i=1 j=1 ij ij i T −1 +2bTi Σ−1 i Ai µi + bi Σi bi Q∗2 (ξ; Ψ(k) ) = − Since APPENDIX A. = 2Σ−1 i bi , −1 ∂x T j Σi b i ∂b i E-STEP (k+1) logLc (Ψ) = logL1c (c) + logL2c (ξ) 0 yields the maximizer bi which is given by (21). Finally, let us consider the maximization of Q2 (ξ; Ψ(k) ) over Ai . We need hence to compute the partial derivative of Q2 (ξ; Ψ(k) ) with respect to Ai . From (34) and (35) it follows that we have to compute “ ” (k) ∂(xj − bi )T Σ−1 i Ai µi E Uj |xj , zij = 1; Ψ (29) where logL1c (c) = g n X X zij logci (30) i=1 j=1 ∂Ai g n X X uj logL2c (ξ) = zij {− (xj − Ai µi − bi )T Σ−1 i (xj − 2 i=1 j=1 = Ai µi − bi )} and c = (c1 , ..., cg ), ξ = (ξ 1 , ..., ξ g ) and ξ i contains the elements of the Ai and the bi . This result can be easily derived by adapting the expression of the complete data loglikelihood of a mixture of t distributions, given in [12], in the context of model (15). Thus, the conditional expectation Q(Ψ; Ψ(k) ) in (17) can be written as Q(Ψ; Ψ(k) ) = Q1 (c; Ψ(k) ) + Q2 (ξ; Ψ(k) ) = (32) (k) rij logci (33) rij Q2j (ξ i ; Ψ(k) ) (34) i=1 j=1 g n X X (k) “ ” − bi )E Uj |xj , zij = 1; Ψ(k) µTi (36) 2Σ−1 i Ai E ∂A “ i ” Uj |xj , zij = 1; Ψ(k) µi µTi (37) Using (37) and (36) and constraining the matrix Ai to be diagonal, the equation ∂Q2 (ξ; Ψ(k) )/∂Ai = 0 yields the maximizer (22). where g n X X Σ−1 i (xj “ “ ” ” (k) ∂trace ATi Σ−1 µi µTi i Ai E Uj |xj , zij = 1; Ψ (31) Q1 (c; Ψ(k) ) = −1 ∂b T i Σi Ai µ i = ∂b i (k) ∂Q∗ (ξ;Ψ ) 2 = ∂b i = Σ−1 i xj , Σ−1 i Ai µi , it is easy to show that the solution of Using (16) and ignoring constant terms, we get Q2 (ξ; Ψ(k) ) = −1 ∂b T i Σi b i ∂b i i=1 j=1 8

Log In

A content-based image retrieval scheme allowing for robust automatic personalization