Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
The explosive growth of multimedia data on Internet creates huge opportunities for online video advertising. In this paper, we propose a novel advertising system called SalAds, which utilizes textual information, visual content and the... more
The explosive growth of multimedia data on Internet creates huge opportunities for online video advertising. In this paper, we propose a novel advertising system called SalAds, which utilizes textual information, visual content and the webpage saliency, to automatically associate the most proper companion ads with online videos. Unlike most existing approaches that only focus on selecting the most relevant ads, SalAds further considers the saliency of selected ads to reduce intentional ignorance. SalAds consists of three basic steps. Given an online video and a set of advertisements, we first roughly identify a set of relevant advertisements based on the textual information matching. We then carefully select a set of candidates based on the visual content matching. In this regard, our selected ads are contextually relevant to online video content in terms of both textual information and visual content. We finally select the most salient ad among the relevant ads as the most appropriate one. To demonstrate the effectiveness of our method, we conduct a rigorous eye-tracking experiment on two ad-datasets. Our experimental results show that our method enhances the user engagement with the ad content, and yet maintain users' video viewing experience, when compared with existing approaches.
Research Interests:
In this paper, we propose using augmented hypotheses which consider objectness, foreground and compactness for salient object detection. Our algorithm consists of four basic steps. First, our method generates the objectness map via... more
In this paper, we propose using augmented hypotheses which consider objectness, foreground and compactness for salient object detection. Our algorithm consists of four basic steps. First, our method generates the objectness map via object-ness hypotheses. Based on the objectness map, we estimate the foreground margin and compute the corresponding foreground map which prefers the foreground objects. From the objectness map and the foreground map, the compactness map is formed to favor the compact objects. We then derive a saliency measure that produces a pixel-accurate saliency map which uniformly covers the objects of interest and consistently separates fore-and background. We finally evaluate the proposed framework on two challenging datasets, MSRA-1000 and iCoSeg. Our extensive experimental results show that our method outperforms state-of-the-art approaches.
Research Interests:
In this work, we propose an alternative ground truth to the eye fixation map in visual attention study, called touch saliency. As can be directly collected from the recorded data of users’ daily browsing behavior on widely used smart... more
In this work, we propose an alternative ground truth to the eye fixation map in visual attention study, called touch saliency.
As can be directly collected from the recorded data of users’ daily browsing behavior on widely used smart phone devices with touch screens, the touch saliency data is easy to obtain. Due to the limited screen size, smart phone users usually move and zoom in the images, and fix the region of interest on the screen when browsing images. Our studies are two-fold. First, we collect and study the characteristics of these touch screen fixation maps (named as touch saliency) by comprehensive comparisons with their counterpart, i.e., the eye-fixation maps (namely, visual saliency). The comparisons show that the touch saliency is highly correlated with the eye fixations for the same stimuli, which indicates its utility in data collection for visual attention study. Based on the consistency between both touch saliency and visual saliency, our second task is to propose a unified saliency prediction model for both visual and touch
saliency detection. This model utilizes middle-level object category features extracted from pre-segmented image superpixels as input to the recently proposed multitask sparsity pursuit (MTSP) framework for saliency prediction. Extensive evaluations show that the proposed middle-level category features can considerably improve the saliency prediction performance when taking both touch saliency and visual saliency as groundtruth.
There is a dearth of information on how perceived auditory information guides image-viewing behavior. To investigate auditory-driven visual attention, we first generated a human eye-fixation database from a pool of 200 static images and... more
There is a dearth of information on how perceived
auditory information guides image-viewing behavior. To investigate auditory-driven visual attention, we first generated a human eye-fixation database from a pool of 200 static images and 400 image-audio pairs viewed by 48 subjects. The eye tracking data for the image-audio pairs were captured while participants viewed images, which took place immediately after exposure to coherent/incoherent audio samples. The database was analyzed in terms of time to first fixation, fixation durations on the target object, entropy, AUC, and saliency ratio. It was found that coherent audio information is an important cue for enhancing the feature-specific response to the target object. Conversely incoherent audio information attenuates this response. Finally, a system predicting the image-viewing with the influence of different audio sources was developed. The detailedly discussed topdown module in the system is composed of auditory estimation based on GMM-MAP-UBM structure, as well as visual estimation based on CRF model and sparse latent variables. The evaluation experiments show that the proposed models in the system exhibit strong consistency with eye fixations.
Decrypting the secret of beauty or attractiveness has been the pursuit of artists and philosophers for centuries. To date, the computational model for attractiveness estimation has been actively explored in computer vision and multimedia... more
Decrypting the secret of beauty or attractiveness has been the pursuit of artists and philosophers for centuries. To date, the
computational model for attractiveness estimation has been actively explored in computer vision and multimedia community, yet with the focus mainly on facial features. In this article, we conduct a comprehensive study on female  attractiveness conveyed by single/multiplemodalities of cues, that is, face, dressing and/or voice, and aim to discover how different modalities individually and collectively affect the human sense of beauty. To extensively investigate the problem, we collect the Multi-Modality Beauty (M2B) dataset, which is annotated with attractiveness levels converted from manual k-wise ratings and semantic attributes of different modalities. Inspired by the common consensus that middle-level attribute prediction can assist higher-level computer vision tasks, we manually labeled many attributes for each modality. Next, a tri-layer Dual-supervised Feature-Attribute-Task (DFAT) network is proposed to jointly learn the attribute model and attractiveness model of single/multiple modalities. To remedy possible loss of information caused by incomplete manual attributes, we also propose a novel Latent Dual-supervised Feature-Attribute-Task (LDFAT) network, where latent attributes are combined with manual attributes to contribute to the final attractiveness estimation. The extensive experimental evaluations on the collected M2B dataset well demonstrate the effectiveness of the proposed DFAT and LDFAT networks for female attractiveness prediction.
In this paper, we present an adaptive nonparametric solution to the image parsing task, namely annotating each image pixel with its corresponding category label. For a given test image, first, a locality-aware retrieval set is extracted... more
In this paper, we present an adaptive nonparametric solution to the image parsing task, namely annotating each image pixel with its corresponding category label. For a given test image, first, a locality-aware retrieval set is extracted from the training data based on super-pixel matching similarities, which are augmented with feature extraction for better differentiation
of local super-pixels. Then, the category of each super-pixel is
initialized by the majority vote of the k-nearest-neighbor superpixels in the retrieval set. Instead of fixing k as in traditional non-parametric approaches, here we propose a novel adaptive nonparametric approach which determines the sample-specific k for each test image. In particular, k is adaptively set to be the number of the fewest nearest super-pixels which the images in the retrieval set can use to get the best category prediction. Finally, the initial super-pixel labels are further refined by contextual smoothing. Extensive experiments on challenging datasets demonstrate the superiority of the new solution over other state-of-the-art nonparametric solutions.
Human weight estimation is useful in a variety of potential applications, e.g. targeted advertisement, entertainment scenarios and forensic science. However, estimating weight only from color cues is particularly challenging since these... more
Human weight estimation is useful in a variety of potential applications, e.g. targeted advertisement, entertainment scenarios and forensic science. However, estimating weight only from color cues is particularly challenging since these cues are quite sensitive to lighting and imaging conditions. In this article, we propose a novel weight estimator based on a single RGB-D image, which utilizes the visual color cues and depth information. Our main contributions are three-fold. First, we construct the W8-400 dataset including RGB-D images of di erent people with ground truth weight. Second, the novel sideview shape feature and the feature fusion model are proposed to facilitate weight estimation. Additionally, we also consider gender as another important factor for human weight estimation. Third, we conduct comprehensive experiments using various regression models and feature fusion models on the new weight dataset, and encouraging results are obtained based on the proposed features and models.
In this paper, we propose a computational framework, called Image Re-Attentionizing, to endow the target region in an image with the ability of attracting human visual attention. In particular, the objective is to recolor the target... more
In this paper, we propose a computational framework, called Image Re-Attentionizing, to endow the target region in an image with the ability of attracting human visual attention. In particular, the objective is to recolor the target patches by color
transfer with naturalness and smoothness preserved yet visual attention augmented. We propose to approach this objective within the Markov Random Field (MRF) framework and an extended graph cuts method is developed to pursue the solution. The input image is first over-segmented into patches, and the patches within the target region as well as their neighbors are used to construct the consistency graphs. Within the MRF framework, the unitary potentials are defined to encourage each target patch to match the patches with similar shapes and textures from a large salient patch database, each of which corresponds to a high-saliency region in one image, while the spatial and color coherence is reinforced as pairwise potentials. We evaluate the proposed method on the direct human fixation data. The results demonstrate that the target region(s) successfully attract human attention and in the meantime both spatial and color coherence is well preserved.
Recently visual saliency has attracted wide attention of researchers in the computer vision and multimedia field. However, most of the visual saliency-related research was conducted on still images for studying static saliency. In this... more
Recently visual saliency has attracted wide attention of researchers
in the computer vision and multimedia field. However,
most of the visual saliency-related research was conducted
on still images for studying static saliency. In this paper,
we give a comprehensive comparative study for the first
time of dynamic saliency (video shots) and static saliency
(key frames of the corresponding video shots), and two key
observations are obtained: 1) video saliency is often different
from, yet quite related with, image saliency, and 2)
camera motions, such as tilting, panning or zooming, affect
dynamic saliency significantly. Motivated by these observations,
we propose a novel camera motion and image saliency
aware model for dynamic saliency prediction. The extensive
experiments on two static-vs-dynamic saliency datasets collected
by us show that our proposed method outperforms
the state-of-the-art methods for dynamic saliency prediction.
Finally, we also introduce the application of dynamic
saliency prediction for dynamic video captioning, assisting
people with hearing impairments to better entertain videos
with only off-screen voices, e.g., documentary films, news
videos and sports videos.
Research Interests: