Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
This article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author’s institution, sharing with colleagues and providing to institution administration. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright Author's personal copy Available online at www.sciencedirect.com Vision Research 48 (2008) 235–243 www.elsevier.com/locate/visres A machine learning predictor of facial attractiveness revealing human-like psychophysical biases Amit Kagian a, Gideon Dror b, Tommer Leyvand a, Isaac Meilijson c, Daniel Cohen-Or a, Eytan Ruppin a,d,* b a School of Computer Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel School of Computer Sciences, The Academic College of Tel-Aviv-Yaffo, Tel-Aviv 64044, Israel c School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel d School of Medicine, Tel-Aviv University, Tel-Aviv 69978, Israel Received 16 June 2007; received in revised form 28 September 2007 Abstract Recent psychological studies have strongly suggested that humans share common visual preferences for facial attractiveness. Here, we present a learning model that automatically extracts measurements of facial features from raw images and obtains human-level performance in predicting facial attractiveness ratings. The machine’s ratings are highly correlated with mean human ratings, markedly improving on recent machine learning studies of this task. Simulated psychophysical experiments with virtually manipulated images reveal preferences in the machine’s judgments that are remarkably similar to those of humans. Thus, a model trained explicitly to capture a specific operational performance criteria, implicitly captures basic human psychophysical characteristics. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Face perception; Facial attractiveness; Machine learning; Aesthetics; Computational neuroscience 1. Introduction Philosophers, artists and scientists have been trying to capture the nature of beauty since the early days of philosophy. Although in modern days a common layman’s notion is that judgments of beauty are a matter of subjective opinion alone, recent findings suggest that people share a common taste for facial attractiveness and that their preferences may be an innate part of our primary constitution. Several experiments have shown that 2–8 months old infants prefer looking at faces rated by adults as more attractive (Langlois et al., 1987). In addition, attractiveness ratings show very high agreement between groups of raters * Corresponding author. Address: School of Computer Sciences and School of Medicine, Tel-Aviv University, Tel-Aviv 69978, Israel. Fax: +972 3 640 9357. E-mail addresses: amit.kagian@gmail.com (A. Kagian), ruppin@post. tau.ac.il (E. Ruppin). 0042-6989/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.visres.2007.11.007 belonging to the same culture and even across cultures (Cunningham, Roberts, Wu, Barbee, & Druen, 1995). Such findings give rise to the quest for common factors which determine human facial attractiveness. Accordingly, various hypotheses, from cognitive, evolutional and social perspectives, have been put forward to describe and interpret the common preferences for facial beauty. Inspired by Sir Francis Galton’s photographic method of composing faces (Galton, 1878), Langlois and Rogmann have created averaged faces by morphing multiple images together. Human judges found these averaged faces to be attractive and rated them with attractiveness ratings higher than the mean rating of the component faces composing them, proposing that averageness is the answer for facial attractiveness (Langlois & Roggman, 1990; Rubenstein, Langlois, & Roggman, 2002). Investigating symmetry and averageness of faces, Grammer and Thornhill concluded that symmetry was more important than averageness in facial attractiveness (Grammer & Thornhill, 1994). Other Author's personal copy 236 A. Kagian et al. / Vision Research 48 (2008) 235–243 studies have agreed that average faces are attractive but claim that faces with certain extreme features, such as extreme sexually dimorphic traits, may be more attractive than average faces (Little, Penton-Voak, Burt, & Perrett, 2002). Yet other researchers have suggested various conditions which may contribute to facial attractiveness such as neonate features, pleasant expressions and familiarity (Zebrowitz & Rhodes, 2002). Finally, Cunningham et al. have suggested a multiple fitness model in which there is no single constructing line that determines attractiveness (e.g. perception of fitness as implying an ideal romantic partner). Instead, different categories of features signal different desirable qualities of the perceived target (Cunningham, Barbee, & Philhower, 2002). Even so, the multiple fitness model agrees that some facial qualities are universally physically attractive to people. Apart from eliciting the facial characteristics which account for attractiveness, modern researchers have aimed to describe the mechanisms underlying these preferences. Many contributors refer to the evolutionary origins of attractiveness preferences (Andersson, 1994; Møller & Swaddle, 1997; Thornhill & Gangsted, 1999). According to this view, facial traits signal mate quality and imply chances for reproductive success and parasite resistance. Some evolutionary theorists suggest that preferred features might not signal mate quality but that the ‘‘good taste’’ by itself is an evolutionary adaptation (individuals with a preference for attractiveness will have attractive offspring that will be favored as mates) (Thornhill & Gangsted, 1999). Another mechanism explains attractiveness’ preferences through a cognitive theory—a preference for attractive faces might be induced as a by-product of general perception or recognition mechanisms (Rubenstein et al., 2002; Zebrowitz & Rhodes, 2002): attractive faces might be pleasant to look at since they are closer to the cognitive representation of the face category in the mind. Halberstadt and Rhodes have further demonstrated that not just average faces are attractive but also birds, fish, and automobiles become more attractive after being averaged with computer manipulation (Halberstadt & Rhodes, 2003). Such findings led researchers to propose that as perceivers can process an object more fluently, aesthetic response becomes more positive (Reber, Schwarz, & Winkielman, 2004). A third view suggests that facial attractiveness originates in a social mechanism, where preferences may be dependent on the learning history of the individual and even on his social goals (Zebrowitz & Rhodes, 2002). Other studies have used computational methods to analyze facial attractiveness. In several cases faces were averaged using morphing tools (e.g. Perrett, May, & Yoshikawa, 1994; Rubenstein et al., 2002). Laser scans of faces were put into complete correspondence with the average face in order to examine the relationship between facial attractiveness, age and averageness (ÓToole, Price, Vetter, Bartlett, & Blanz, 1999). A genetic algorithm, guided by interactive user selections was programmed to evolve a ‘‘most beautiful’’ female face (Johnston & Franklin, 1993). Machine learning methods have been used recently to investigate whether a machine can predict attractiveness ratings by learning a mapping from facial images to their attractiveness scores (Eisenthal, Dror, & Ruppin, 2006). The latter predictor achieved a correlation of 0.6 with average human ratings, demonstrating that facial beauty can be learned by a machine, at least to some moderate extent. However, as human raters significantly outperform the predictor of Eisenthal et al., the challenge of constructing a facial attractiveness machine predictor with human-level accuracy has remained open. A primary goal of this study is to surpass these results by developing a machine which obtains human-level performance in predicting facial attractiveness and, thus, passes what Kurzweil calls a subject matter expert turing test (SME TT) (Kurzweil, 2005). Having accomplished this, our second main goal is to conduct a series of simulated psychophysical experiments and study the resemblance between human and machine judgments. This latter task carries two potential rewards: first, to determine whether the machine can aid in understanding the psychophysics of human facial attractiveness, capitalizing on the ready accessibility of manipulating and studying its performance, and second, to study whether learning an explicit operational ratings prediction task also entails learning implicit human-like biases, at least for the case of facial attractiveness. In the past decades machines have achieved human-level performance in rule-based systems such as playing games (Schaeffer & Herik, 2002) and in various expert systems (Slezak, 1991). Impressive progress has been displayed in simulating various tasks which involve face perception, such as face detection (Hjelmas & Low, 2001), face recognition (Becker, 1999; Zhao, Chellappa, Rosenfeld, & Phillips, 2000) and tasks of facial category learning such as emotion (Dailey, Cottrell, Padgett, & Adolphs, 2002) and gender (Graf, Wichmann, Bülthoff, & Schölkopf, 2006) recognition. The task of evaluating human attractiveness ratings adds the notion of judgment of taste to the previous achievements in machine perception of faces. Learning the concept of facial attractiveness could form an important demonstration of a computer’s ability to learn to master a quantitative, basic, human judgment task. To this end we have collected human scores of facial attractiveness for a given dataset of female facial images. We developed an algorithm for automatic extraction of a very large set of geometric facial features, which, combined with a set of global features, yields a principled representation of each facial image via a set of image-features in an appropriate dimension-reduced space. Using this data of facial representations and their associated rating scores, we have employed standard supervised learning algorithms to construct a facial attractiveness prediction machine. Given a new, unseen face, this machine predicts its human attractiveness score in an accurate manner. We then turned to performing a series of simulated psychophysical experiments, modeled after known experiments in the psycholog- Author's personal copy A. Kagian et al. / Vision Research 48 (2008) 235–243 237 ical literature, to study the resemblance between human and machine preferences. These experiments are particularly interesting since the machine is trained on an explicit operational ratings prediction task with no defined instructions specifying the human-like biases in question. 2. Materials and methods 2.1. Rating the facial database The chosen database was composed of 91 facial images of American females, taken by the Japanese photographer Akira Gomi. All 91 samples were frontal color photographs of young Caucasian females with a neutral expression. All samples were of similar age, skin color and gender. The subjects’ portraits had no accessories or other distracting items such as jewelry. We focused on female faces since experimental results shows that there is a greater agreement on human ratings of female faces while male face preferences are more largely influenced by the menstrual cycle and self-perceived attractiveness of the raters (Little et al., 2002). All 91 facial images in the dataset were rated for attractiveness by 28 human raters (15 males, 13 females) on a 7-point Likert scale (1 = very unattractive, 7 = very attractive). Ratings were collected with a specifically designed html interface. Each rater was asked to view the entire set before rating in order to acquire a notion of attractiveness scale. There was no time limit for judging the attractiveness of each sample and raters could go back and adjust the ratings of previously rated samples. The images were presented to each rater in a random order and each image was presented on a separate page. The final attractiveness rating of each sample was its mean rating across all raters. To validate that the number of ratings collected adequately represented the ‘‘collective attractiveness rating’’ we randomly divided the raters into two disjoint groups of equal size. For each facial image, we calculated the mean rating on each group, and calculated the Pearson correlation between the mean ratings of the two groups. This process was repeated 1000 times. The mean correlation between two groups was 0.92 (r = 0.01). It should be noted that the split-half correlations reported were high in all 1000 trials (as evident from the low standard deviation) and not only over the average. This correlation corresponds well to the known level of consistency among groups of raters reported in the literature (e.g. Cunningham et al., 1995). Hence, the mean ratings collected are stable indicators of attractiveness that can be used for the learning task. The facial set contained faces in all ranges of attractiveness. Final attractiveness ratings range from 1.42 to 5.75 and the mean rating was 3.33 (r = 0.94). 2.2. Data preprocessing and representation Preliminary experimentation with various ways of representing a facial image (e.g. Eisenthal et al., 2006) have systematically shown that features based on measured proportions, distances and angles of faces are most effective in capturing the notion of facial attractiveness. To extract facial features we developed an automatic engine that is capable of identifying eyes, nose, lips, eyebrows and head contour. In total, we measured 84 coordinates describing the locations of those facial features (Fig. 1). Several regions are automatically suggested for sampling mean hair color, mean skin color and skin texture. The feature extraction process was basically automatic but some coordinates needed to be manually adjusted in some of the images. The facial coordinates are used to create a distances-vector of all 3486 distances between all pairs of coordinates in the complete graph created by all coordinates. For each image, all distances are normalized by face length (as measured from the coordinate at the top of the forehead to the coordinate at the bottom of the chin). In a similar manner, a slopes-vector of all the 3486 slopes of the lines connecting the facial coordinates is computed. Central fluctuating asymmetry (CFA) is calculated from the coordinates as well. CFA corresponds to the sum of the absolute values of the differences of the midpoints of adjacent horizontal lines which connect matching bilateral facial coordinates Fig. 1. Facial coordinates with hair and skin sample regions as represented by the facial feature extractor. Coordinates are used for calculating geometric features and asymmetry. Sample regions are used for extracting color values and smoothness. (The sample image, used for illustration only, is of T.G. and is presented with her full consent.) (see Grammer & Thornhill, 1994). The application also provides, for each face, hue, saturation and value (HSV) values of hair color and skin color, and a measurement of skin smoothness. Smoothness of skin was calculated with an edge-detection algorithm in which many detected edges suggest a low level of skin smoothness. Combining the distances-vector and the slopes-vector yields a vector representation of 6972 geometric features for each image. Since strong correlations are expected among the features in such representation, principal component analysis (PCA) was applied to the 6972 geometric features, producing 90 principal components that span the sub-space defined by the 91 image vector representations. The geometric features are projected on those 90 principal components to produce 90 decorrelated eigenfeatures representing the geometric features of the images. Eight non-geometric measured features were not included in the PCA analysis, including CFA, smoothness, hair color coordinates (HSV) and skin color coordinates. These features are assumed to be directly connected to human perception of facial attractiveness and are hence kept at their original values. These 8 features were added to the 90 geometric eigenfeatures, resulting in a total of 98 image-features representing each facial image in the dataset. 2.3. Predictor construction and validation We experimented with several induction algorithms including simple Linear Regression, Least Squares Support Vector Machine (LS-SVM) both linear as well as non-linear (Suykens, Van Gestel, De Brabanter, De Moor, & Vandewalle, 2002) and Gaussian Processes (GP) (Rasmussen & Williams, 2006). However, as the LS-SVM and GP showed no substantial advantage over Linear Regression, the latter was used and is presented in the sequel. A key ingredient in our method is to use a proper image-features selection strategy. To this end we used subset feature selection, implemented by ranking the image-features by their Pearson correlation with the target. Other ranking functions produced no substantial gain. To measure the performance of our method we removed one sample from the whole dataset. This sample served as a test set. We found, for each left out test sample, the optimal number of image-features by performing leave-one-outcross-validation (LOOCV) on the remaining samples and selecting the number of features that minimized the absolute difference between the algorithm’s output and the targets of the training set. Ranking of features was conducted independently for each held out image and performance was measured by aggregating together the scores of all images. In other words, while setting aside a test sample, we used LOOCV on the remaining training samples in order to optimize the number of features to select, mi, and afterwards used this number of features to predict a single, fixed attractiveness score for the left out test sample, that is, the score for a test exam- Author's personal copy 238 A. Kagian et al. / Vision Research 48 (2008) 235–243 ple was predicted using a single model based on the training set only. This process was repeated n = 91 times, once for each image sample, resulting with a vector of attractiveness predictions for all images. In order to avoid overfitting, the entire learning procedure (including feature selection) is repeated from scratch for each data partition, so that a different number of features are selected for each data partition. The number of selected features, mi, ranges between 50 and 77. mi = 67 features were most frequently selected. To examine the influence of resetting the number of selected features at each fold, we tested the predictor in a leave-one-out-cross-validation, on the entire dataset, while keeping the number of selected features constant in all iterations (and not reselecting it each time). The fixed number of selected features ranged between m = 1 (a single feature) and m = 98 (all features). The best Pearson correlation of 0.87 was achieved with m = 64. These 64 features include 7 of the 8 non-geometric features (all but Hair-hue). The remaining 57 geometric eigenfeatures explain 96% of the variance of the geometric features. For all 54 6 m 6 74, Pearson and Spearman correlations were above 0.8. As is evident, a fixed number of selected features can yield better performance than the one our method of reselection produced. Still, we did not use a fixed value for the number of selected features, m, since we did not want to rely on test performance when choosing the constant value of m in order to avoid overfitting. It should be noted that we tried to use the same feature selection and training procedure with the original geometric features (without PCA) instead of using the eigenfeatures. This, however, has failed to produce good predictors due to strong correlations between the original geometric features (the maximal Pearson correlation obtained was 0.26). Once the predictor was constructed and validated, we turn to simulate a number of psychophysical experiments that were previously conducted with human subjects. 2.3.1. Experiment 1 We created virtual face composites for the machine to rate by simulating a morphing technique similar to the one used by Rubenstein et al. (2002). Coordinate values of the original component faces were averaged to create a new set of coordinates for the virtual face composite. These coordinates were used to calculate the geometrical features and CFA of the averaged face. Smoothness and HSV values for the composite face were calculated by averaging the corresponding values of the component faces (HSV values are converted to RGB before averaging). To study the effect of the number of component faces, nc, on the attractiveness score of face composites we produced 1000 virtual morph images for each value of nc between 2 and 50, and used our attractiveness predictor to compute the attractiveness scores of the resulting composites. 2.3.2. Experiment 2 In order to further examine the importance of symmetry on the machine’s attractiveness judgments of averaged composites, we repeated the virtual composites experiment (Experiment 1) using perfectly symmetric faces as image components. Perfectly symmetric virtual versions of the original images were created by a similar technique to the one used by Rhodes, Sumich, and Byatt (1999), that is, each original face was virtually morphed with its mirror image in order to create a perfectly symmetric version of it. Averaging together perfectly symmetric component faces produces perfectly symmetric face composites. In the same manner as in Experiment 1, 1000 virtual composites were created for each number of components, nc, between 2 and 50, and the machine rated them for attractiveness. 2.3.3. Experiment 3 Analogously to Zaidel, Chen, and German (1995) who have created chimeric facial composites by attaching one-half of the face to its mirror image, Right–right and left–left virtual chimeric composites were produced from the extracted coordinates of all original images and the machine was used to predict their attractiveness ratings. Learning was repeated for each chimeric composite with the original image used for each chimeric composition being excluded from the training set, to avoid a misleading positive bias as a consequence of the fact that the original image contains many features which are identical to those of the matching composite. 2.4. Facial features and attractiveness The original measured facial features were ranked according to their correlation to human and machine ratings and the Spearman rank correlation between the two rankings was calculated. This analysis was repeated three times: (a) with 6980 predictor-features, (b) with 28 features investigated in previous studies and (c) with 13 features previously found to be significantly related to facial attractiveness. To determine the P-value of the rank correlation we repeated the features ranking 100 times with shuffled machine and human ratings. In none of the shuffled trials was the rank correlation as high as with the actual ratings. The 28 features we focused on are (features marked with an * were previously found to be significantly correlated with facial attractiveness): (1) forehead height, (2) eye height*, (3) eye width*, (4) separation of eyes*, (5) nose tip width, (6) nostril width*, (7) nose length (to eye top), (8) nose area*, (9) upper lip thickness, (10) lower lip thickness, (11) chin length*, (12) cheekbone width*, (13) jaw (cheek) width*, (14) mid-face length, (15) eyebrow height*, (16) mouth width* (features 1–16 are taken from Cunningham, 1986), (17) outer eye corner width*, (18) inner eye corner width, (19) cheek width, (20) cheekbone prominence*, (21) lower face proportions (features 17–12 are taken from Grammer & Thornhill, 1994), (22) forehead height (to eyes), (23) brow height, (24) brow curvature, (25) lower face length, (26) nose length, (27) mouth height*, (28) cheekbone height (features 22–28 are taken from Grammer, Fink, Juette, Ronzal, & Thornhill, 2002). (See Supporting Information for a detailed description of the calculation of these 28 feature measurements according to the raw coordinate representation.) 3. Results 3.1. Prediction accuracy of facial attractiveness Machine attractiveness ratings of all sample images obtained a high Pearson correlation of 0.82 (Pvalue < 10 23) with the mean ratings of human raters (the learning targets), corresponding to a normalized mean squared error of 0.39. This accuracy is a marked improvement over the recently published performance results of a Pearson correlation of 0.6 on a similar dataset (Eisenthal et al., 2006). The average correlation of an individual human rater to the mean ratings of all other human raters in our dataset is 0.67 and the average correlation between the mean ratings of groups of human raters is 0.92 (see Section 2). The Spearman rank correlation between machine and mean human ratings is 0.83. To further validate the correlation measures we removed the most attractive 12% and the least attractive 12% of the samples from the dataset and recalculated the correlations. Correlation values remained high with 0.80 (Pearson) and 0.81 (Spearman). To get a notion of the contribution of the 8 global, non-geometric features to attractiveness prediction, we have trained the predictor while removing them one at a time. This resulted in correlations of 0.68 when excluding asymmetry, 0.80 when excluding smoothness, 0.77 when excluding hair color (3 attributes) and 0.77 when excluding skin color (3 attributes). Excluding all non-geometric features and using geometric features alone yielded a correlation of 0.74. 3.2. Similarity of machine and human judgments The ratings of each rater (28 human raters and the machine predictor) form a 91 dimensional rating vector Author's personal copy A. Kagian et al. / Vision Research 48 (2008) 235–243 Fig. 2. Distribution of mean Euclidean distance from each human rater to all other raters in the ratings space. The machine’s average distance from all other raters (left bar) is smaller than the average distance of each of the human raters to all others. describing its attractiveness ratings of all 91 images. These vectors can be embedded in a 91 dimensional ratings space. The Euclidean distance between all raters (human and machine) in this space was computed. Compared with each of the human raters, the ratings of the machine were the closest, on average, to the ratings of all other human raters (Fig. 2). Although, by construction, the machine’s rating vector lies near the mean of human ratings, it may still be very different from any individual human rating vector. This may happen, e.g. when the distribution of human ratings forms several clusters or is non-convex. To assure this is not the case, we counted the number of human ratings vectors within small multidimensional spheres around each human rater as well as the rating of the machine. The machine had more human neighbors than the mean number of neighbors that a human rater had, even when the radiuses of the spheres were very small, testifying that it does not fall between clusters. Finally, to visualize the machine ratings among human ratings we applied PCA to machine and human ratings in the rating space and projected all ratings onto the resulting first 2 and 3 principal components. Indeed, the machine is well placed in a mid-zone of human raters (Fig. 3). 3.3. Human-like biases in the machine’s performance 3.3.1. Experiment 1: The averageness hypothesis: A preference for averaged face composites Rubenstein et al. (2002) discuss a morphing technique to create mathematically averaged faces from multiple face images. They report that averaged faces made of 16 and 32 original component images were rated by humans higher in attractiveness than the mean attractiveness rat- 239 ings of their component faces and higher than composites consisting of fewer faces. In their experiment, 32-component composites were found to be the most attractive. In accordance with these experimental results, the predictor manifests a human-like bias for higher scores for averaged composites over their components’ mean score. Fig. 4a shows the percent of components which were rated as less attractive than their corresponding composite, for each number of components nc. As evident, the attractiveness rating of a composite surpasses a larger percent of its components’ ratings as nc increases. Fig. 4a also shows the mean scores of 1000 composites and the mean scores of their components, for each nc (scores are normalized to the range [0, 1]). Their actual attractiveness scores are reported in Table 1. As expected, the mean scores of the component images are independent of nc, while composites’ scores increase with nc (see Supporting Information for a more detailed analysis of the difference between composites and components scores). Recent studies have provided evidence that skin texture influences judgments of facial attractiveness (Fink, Grammer, & Thornhill, 2001). Since blurring and smoothing of faces occur when faces are averaged together (Rubenstein et al., 2002), the smooth complexion of composites may underlie the attractiveness of averaged composites. In our experiment, a preference for averageness is found even though our method of virtual-morphing does not produce the smoothening effect and the mean smoothness value of composites corresponds to the mean smoothness value in the original dataset, for all nc (see Fig. 4b). Researchers have also suggested that averaged faces are attractive since they are exceptionally symmetric (Alley & Cunningham, 1991). Fig. 3a and b shows that the mean level of asymmetry (CFA, see Section 2) is indeed highly correlated with the mean scores of the composites (Pearson correlation of 0.91, P-value < 10 19). However, examining the correlation between the rest of the image-features and the composites’ scores reveals that this high correlation is not at all unique to asymmetry. In fact, as the images are being morphed, the changes in 45 of the 98 image-features are strongly correlated with the changes in attractiveness scores (|Pearson correlation| > 0.9). The high correlation between these numerous features and attractiveness scores of averaged faces indicates that symmetry level is not an exceptional factor in the machine’s preference for averaged faces. Instead, it suggests that averaging causes many features to change in a direction which causes an increase in attractiveness. It has been argued that although averaged faces are found to be attractive, very attractive faces are not average (Alley & Cunningham, 1991). A virtual composite made of the 12 most attractive faces in the set (as rated by humans) was rated by the machine with a high score of 5.6 while 1000 composites made of 50 faces from random levels of attractiveness got a maximum score of only 5.3. (Their mean score was only 3.94 as reported in Table 1.) This type of preference resembles the findings of Perrett et al. (1994) Author's personal copy 240 A. Kagian et al. / Vision Research 48 (2008) 235–243 Fig. 3. Location of machine ratings among the 28 human ratings. Ratings were projected into 2 dimensions (a) and 3 dimensions (b) by performing PCA on all ratings and projecting them on the first principal components. The projected data explain 29.8% of the variance in (a) and 36.6% in (b). Fig. 4. Experiment 1: The averageness hypothesis: (a) percent of components that were rated as less attractive than their corresponding composite, accompanied with mean scores of composites and the mean scores of their components (scores are normalized to the range [0, 1], actual attractiveness scores are reported in Table 1). (b) Mean values of smoothness and asymmetry of 1000 composites for each number of components, nc. Table 1 Mean results over 1000 composites made of varying numbers of component images Number of components in composite Mean composite score Mean components score Components rated lower than composite (%) 2 4 12 25 50 3.46 3.66 3.74 3.82 3.94 3.34 3.33 3.32 3.32 3.33 55 64 70 75 81 in which a highly attractive composite, morphed from only attractive faces, was preferred by humans over a composite made of 60 images of all levels of attractiveness. 3.3.2. Experiment 2: Perfectly symmetric averaged faces Rhodes et al. (1999) inquired whether changes in attractiveness produced by manipulating the averageness of individual faces should disappear when all the images are made perfectly symmetric. They created perfectly symmetric composites by morphing original images with their matching mirror images. In their experiment human subjects showed a preference for averaged face composites even when the effect of symmetry is controlled for. Similarly, in our experiment, the effect of symmetry was neutralized by using only perfectly symmetric component faces which yielded perfectly symmetric composites (see Fig. 5b). It can be seen that the results presented in Fig. 5a are similar to those of Experiment 1 (Fig. 4a). That is, even though the Author's personal copy A. Kagian et al. / Vision Research 48 (2008) 235–243 241 Fig. 5. Mean results over 1000 perfectly symmetric composites made of varying numbers of perfectly symmetric image components: (a) percent of components which were rated as less attractive than their corresponding composite, accompanied with mean scores of composites and the mean scores of their components (scores are normalized to the range [0, 1]). (b) Mean values of smoothness and asymmetry of 1000 composites for each number of components, nc. effect of symmetry is controlled for, attractiveness scores of averaged face composites increases with the number of components, nc. Mean values of smoothness and asymmetry of the composites are presented in Fig. 5b. These results show that the machine’s preference for averaged composites is not dependent on symmetry alone, in accordance with the experimental results of Rhodes et al. (1999) and with our conclusions from Experiment 1 (see Supporting Information for a more detailed analysis of the difference between perfectly symmetric composites and components scores). 3.3.3. Experiment 3: Asymmetry of facial attractiveness perception A recent study examining the asymmetry of attractiveness perception has offered an intriguing relationship between facial attractiveness and hemispheric specialization (Zaidel et al., 1995). In this research, right–right and left–left chimeric composites (where ‘left’ refers to the subject’s side of the face) were created by attaching each half of the face to its mirror image. Human subjects were asked to look at left–left and right–right composites of the same image and judge which one is more attractive. For women’s faces, right–right composites, composed of the right half of the subject’s face, got twice as many ‘more attractive’ responses than left–left composites. Interestingly, similar results to those were found in Experiment 3 in which we simulated this phenomenon by comparing the machine’s rating of facial attractiveness for left–left and right–right composites. The machine gave 63 out of 91 right–right composites a higher rating than their matching left–left composite, while only 28 left–left composites were judged as more attractive. A paired t-test shows these results to be statistically significant with P-value < 10 7 (scores of chimeric composites are approximately normally distributed). When rating composites created from a certain image the machine was trained without the original image in its training set. Since the machine representation of the images is completely symmetric, any asymmetric bias revealed is likely to be an implicit manifestation of a psychophysical bias of the human raters. It is interesting to see that the machine manifests the same kind of asymmetry bias reported by Zaidel et al. (1995), though it has never been explicitly trained for that. 3.4. Facial features and attractiveness After establishing that our machine exhibits human-like biases, we turn to compare its processing with those reported in the pertaining human psychophysics literature. A number of studies have singled out facial features that are especially relevant to facial attractiveness, by identifying significant correlations between facial features measurements and human attractiveness ratings (Cunningham et al., 2002; Grammer & Thornhill, 1994; Little et al., 2002). In analogy, facial features that are significantly correlated with the machine’s ratings may be considered as important in determining the machine’s perception of attractiveness. In order to examine whether the important features according to the machine are similar to those of humans, we calculated the correlation between each of the 6980 features used by our predictor (6972 geometric Author's personal copy 242 A. Kagian et al. / Vision Research 48 (2008) 235–243 features and 8 non-geometric measurements) and the machine and human ratings (separately). The features were ranked according to their absolute correlation to attractiveness ratings which resulted with two feature rankings: human and machine. The Spearman rank correlation between human and machine ranking was 0.57 and significant (P-value < 0.01, see Section 2). To further compare between feature rankings of humans and machine, we repeated the above computation focusing on 28 facial features which were previously studied in the literature of human facial attractiveness (see Section 2). Those 28 features were now ranked according to machine and human ratings and the Spearman rank correlation between the two rankings was 0.68 (P-value < 0.01). Out of those 28 features, only 13 were found to be significantly related to facial attractiveness in the original studies (Cunningham et al., 2002; Grammer & Thornhill, 1994; Little et al., 2002). Ranking these 13 facial features according to machine and human ratings, yields a Spearman rank correlation of 0.75 (P-value < 0.01) between the rankings. These results provide further evidence of the human-like nature of the machine’s perception of attractiveness, as they show that features that were previously related to facial attractiveness are ranked similarly according to the machine and according to human raters. 4. Discussion In this work, we constructed a high quality training set for learning facial attractiveness of human faces. Using a combination of extensive automatic facial feature extraction, dimension reduction and feature selection, and supervised learning methodologies; we created the first accurate facial attractiveness predictor. Our results add the task of facial attractiveness prediction to the collection of abstract tasks that have been successfully accomplished with current machine learning techniques. While previous machines that successfully passed a subject matter expert turing test (SME TT) have dealt with rule-based cognitive systems, such as playing games, or perceptual tasks of category learning, such as emotion recognition, our machine predicts continuous facial attractiveness ratings and passes a perceptual SME TT that concerns simulating judgment of taste. Whether to compare the machine’s performance in the task to the performance of an individual human rater or to a group of raters is an interesting issue: the machine is an ‘individual rater’ which learns ‘group average ratings’ and thus is essentially a hybrid between the two. For that reason we report on both benchmarks, that is, the human individual-to-group mean correlation of 0.67 and the human group-to-group mean correlation of 0.92, and indeed, we find the machine’s performance (correlation of 0.82) between the two. One of the main improvements over previous similar works, such as the work of Eisenthal et al. (2006), is the much richer representation of 84 facial coordinates and 6972 distance and angle features (induced from the full graph on the facial coordinates). This suggests that improving the facial representation might be valuable for future research. One promising suggestion is to employ a non-metric facial representation which may expedite the learning of human facial attractiveness. Examining the machine and human raters’ representations in the ratings space identifies the ratings of the machine near the center of the distribution of human ratings, and closest, on average, to other human raters. The ranking of facial features according to their correlations with machine ratings is correlated significantly with the ranking of those features according to human ratings. The similarity between human and machine preferences has prompted us to further study the machine’s operation. To this end, we have found that the machine favors averaged faces made of several component faces. While this preference is known to be common to humans as well, researchers have previously offered different reasons for favoring averageness. Our analysis has revealed that symmetry is strongly related to the attractiveness of averaged faces, but is definitely not the only factor in the equation, since about half of the image-features relate to the ratings of averaged composites in a similar manner as symmetry and since a preference for averaged faces was found even when the effect of symmetry was neutralized. This suggests that a general movement of features toward attractiveness, rather than a simple increase in symmetry, is responsible for the attractiveness of averaged faces. This movement suggests a convergence towards a prototypical facial representation that matches the cognitive explanations of the averageness hypothesis (Rubenstein et al., 2002). Obviously, this is true only for the machine, but given the human-like biases displayed by our predictor, this may extend also to human perception of facial attractiveness. Overall, it is quite surprising and pleasing to find that a model trained explicitly to capture a specific operational performance criteria such as attractiveness rating (weak AI), implicitly and concomitantly captures basic human psychophysical biases and demonstrates a wide range of human-level characteristics of facial attractiveness judgment (strong AI), as revealed by studying its ‘‘psychophysics’’. Acknowledgments We thank Dr. Bernhard Fink and the Ludwig-Boltzmann Institute for Urban Ethology at the Institute for Anthropology, University of Vienna, Austria, and Prof. Alice J. O’Toole from the University of Texas at Dallas, for kindly letting us use their face databases. This work was supported by the internal research fund of The Academic College of Tel-Aviv-Yaffo. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.visres. 2007.11.007. Author's personal copy A. Kagian et al. / Vision Research 48 (2008) 235–243 References Alley, T. R., & Cunningham, M. R. (1991). Averaged faces are attractive but very attractive faces are not average. Psychological Science, 2, 123–125. Andersson, M. (1994). Sexual selection. Princeton, NJ: Princeton University Press. Becker, S. (1999). Implicit learning in 3D object recognition: The importance of temporal context. Neural Computation, 11(2), 347–374. Cunningham, M. R. (1986). Measuring the physical attractiveness: Quasiexperiments on the sociobiology of female facial beauty. Journal of Personality and Social Psychology, 50, 925–935. Cunningham, M. R., Barbee, A. P., & Philhower, C. L. (2002). Dimensions of facial physical attractiveness: The intersection of biology and culture. In G. Rhodes & L. A. Zebrowitz (Eds.). Advances in visual cognition, vol. 1: facial attractiveness. Westport, CT: Ablex. Cunningham, M. R., Roberts, A. R., Wu, C.-H., Barbee, A. P., & Druen, P. B. (1995). Their ideas of beauty are, on the whole, the same as ours: Consistency and variability in the cross-cultural perception of female physical attractiveness. Journal of Personality and Social Psychology, 68, 261–279. Dailey, M. N., Cottrell, G. W., Padgett, C., & Adolphs, R. (2002). EMPATH: A neural network that categorizes facial expressions. Journal of Cognitive Neuroscience, 14(8), 1158–1173. Eisenthal, Y., Dror, G., & Ruppin, E. (2006). Facial attractiveness: Beauty and the machine. Neural Computation, 18, 119–142. Fink, B., Grammer, K., & Thornhill, R. (2001). Human (homo sapiens) facial attractiveness in relation to skin texture and color. Journal of Comparative Psychology, 115, 92–99. Galton, F. (1878). Composite portraits. Journal of the Anthropological Institute of Great Britain and Ireland, 8, 132–142. Graf, A. B. A., Wichmann, F. A., Bülthoff, H. H., & Schölkopf, B. (2006). Classification of faces in man and machine. Neural Computation, 18, 143–165. Grammer, K., & Thornhill, R. (1994). Human (Homo sapiens) facial attractiveness and sexual selection: The role of symmetry and averageness. Journal of Comparative Psychology, 108, 233–242. Grammer, K., Fink, B., Juette, A., Ronzal, G., & Thornhill, R. (2002). Female faces and bodies: N-dimensional feature space and attractiveness. In G. Rhodes & L. A. Zebrowitz (Eds.), Advances in visual cognition, vol. 1: facial attractiveness. Westport, CT: Ablex. Halberstadt, J. B., & Rhodes, G. (2003). It’s not just average faces that are attractive: Computer-manipulated averageness makes birds, fish, and automobiles attractive. Psychonomic Bulletin and Review, 10, 149–156. Hjelmas, E., & Low, B. K. (2001). Face detection: A survey. Computer Vision and Image Understanding, 83, 236–274. Johnston, V. S., & Franklin, M. (1993). Is beauty in the eye of the beholder? Ethology and Sociobiology, 14, 183–199. Kurzweil, R. (2005). The singularity is near: When humans transcend biology. Viking Penguin. 243 Langlois, J. H., & Roggman, L. A. (1990). Attractive faces are only average. Psychological Science, 1, 115–121. Langlois, J. H., Roggman, L. A., Casey, R. J., Ritter, J. M., RieserDanner, L. A., & Jenkins, V. Y. (1987). Infant preferences for attractive faces: Rudiments of a stereotype? Developmental Psychology, 23, 363–369. Little, A. C., Penton-Voak, I. S., Burt, D. M., & Perrett, D. I. (2002). Evolution and individual differences in the perception of attractiveness: How cyclic hormonal changes and self-perceived attractiveness influence female preferences for male faces. In G. Rhodes & L. A. Zebrowitz (Eds.), Advances in visual cognition, vol. 1: facial attractiveness. Westport, CT: Ablex. Møller, A. P., & Swaddle, J. P. (1997). Asymmetry, developmental stability, and evolution. Oxford: Oxford University Press. ÓToole, A. J., Price, T., Vetter, T., Bartlett, J. C., & Blanz, V. (1999). 3D shape and 2D surface textures of human faces: The role of averages in attractiveness and age. Image and Vision Computing, 18, 9–19. Perrett, D. I., May, K. A., & Yoshikawa, S. (1994). Facial shape and judgments of female attractiveness. Nature, 368, 239–242. Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. The MIT Press. ISBN 0-262-18253-X. Reber, R., Schwarz, N., & Winkielman, P. (2004). Processing fluency and aesthetic pleasure: Is beauty in the perceiver’s processing experience? Personality and Social Psychology Review, 8, 364–382. Rhodes, G., Sumich, A., & Byatt, G. (1999). Are average facial configurations attractive only because of their symmetry? Psychological Science, 10, 52–58. Rubenstein, A. J., Langlois, J. H., & Roggman, L. A. (2002). What makes a face attractive and why: The role of averageness in defining facial beauty. In G. Rhodes & L. A. Zebrowitz (Eds.), Advances in visual cognition, vol. 1: facial attractiveness. Westport, CT: Ablex. Schaeffer, J., & Herik, H. J. (2002). Games, computers, and artificial intelligence. Artificial Intelligence, 134, 1–7. Slezak, P. (1991). Artificial experts: Essay review. Social Studies of Science, 22(1), 175–201. Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2002). Least squares support vector machines. World Scientific, Singapore. ISBN 981-238-151-1. Thornhill, R., & Gangsted, S. W. (1999). Facial attractiveness. Trends in Cognitive Sciences, 3, 452–460. Zaidel, D. W., Chen, A. C., & German, C. (1995). She is not a beauty even when she smiles: Possible evolutionary basis for a relationship between facial attractiveness and hemispheric specialization. Neuropsychologia, 33(5), 649–655. Zebrowitz, L. A., & Rhodes, G. (2002). Nature let a hundred flowers bloom: The multiple ways and wherefores of attractiveness. In G. Rhodes & L. A. Zebrowitz (Eds.), Advances in visual cognition, vol. 1: facial attractiveness. Westport, CT: Ablex. Zhao, W. Y., Chellappa, R., Rosenfeld, A., & Phillips, P. J. (2000). Face recognition: A literature survey. UMD CfAR Technical Report CARTR-948.