1. Introduction
The human hand plays a very important role in communication when people interact with each other and with the environment in everyday life. The recognition of hand gestures and human activities is closely related to the locations and directions of human hands. Therefore, detecting human hand reliably [
1] from single color images or videos which are captured from common image sensors plays an important role in many computer vision-related applications, such as human–computer interaction [
2,
3], human hand pose estimation [
4,
5,
6], human gesture recognition [
7,
8], human activity analysis [
9], and so on.
In the computer vision field, the pipeline of hand-related applications usually contains three steps: (1) hand detection, (2) hand pose estimation and (3) static gesture recognition or dynamic action recognition. The second step is optional because gesture/action recognition can be performed with or without hand pose estimation. About five years ago, hand pose estimation and action recognition were the most challenging steps (or bottlenecks) in the pipeline even in the constrained environment (normally only single hand and simple background in an image) from which the hand can be easily detected or assumed to be already cropped. Therefore, the hand-related community primarily focused on the second and third steps in the past. However, nowadays hand pose estimation and gesture/action recognition in the constrained environments are reaching the mature level, and hand-related applications in an unconstrained environment (complex background and the number of hands in an image is unknown) will be an important trend in the future. In this condition, hand detection in an unconstrained environment becomes a new bottleneck in the hand-related works. Thus, the high precision hand detection method will be a crucial step in the pipeline for hand-related applications in unconstrained environment. In this paper, we focus on the hand detection algorithm.
Traditional hand detection methods primarily utilize low-level image features such as skin color [
10] and shape [
11,
12] for hand region detection. Nowadays, convolutional neural networks (CNN) based detection approaches [
13,
14,
15,
16,
17,
18,
19] are proved to be more robust and accurate [
20,
21,
22] due to the discriminative deep features learned. However, compared to common objects, human hands are highly articulated, appearing in various orientations, scales, shapes, skin colors, and sometimes partial occlusions, therefore reliably detecting multiple human hands from unconstrained cluttering scenes remains to be a challenging problem.
To tackle this problem, we propose an approach to accurately detect human hands from single color images by reconstructing the hand appearances. It can also be applied to video clips, as video clips can be considered as sequences of single images. The spirit of our approach is primarily oriented from multitask learning [
23] which improves generalization of the network by learning tasks in parallel while using a shared representation. The quality of the hand detection task is closely related to the diversity of hand appearances in terms of hand shape, skin color, orientation, scale, and partial occlusion, etc. As a result, the shared information contained in the training signal of the hand appearance reconstruction task can be utilized as an inductive bias to improve the performance of the hand detection task.
The general process of the proposed approach is illustrated in
Figure 1. Firstly, feature maps of the whole input image are extracted by shared convolutional layers. Then, they are fed into a region proposal network (RPN) to generate possible region proposals, a.k.a. region of interest (RoI). Finally, the feature maps of RoIs are used to classify the corresponding labels (hand/background), to refine the locations of detected hands, and to reconstruct the appearances of hands simultaneously.
In this paper, a new hybrid detection/reconstruction CNN framework is present. The detection branch and reconstruction branch share the same feature presentation, so that the shared information contained in the hand appearance reconstruction branch can be utilized to boost the performance of the hand detection branch. We adopt the idea of variational autoencoder (VAE) [
24,
25] for hand appearance reconstruction. On the one hand, there is no previous work which adopted VAE in hand or object detection framework as far as we know. On the other hand, our approach is not a simple combination of the faster R-CNN, VAE, and GAN. Different from the traditional VAE model, our VAE model is asymmetric. During the detection, the shared information contained in the shared CNN and RPN layers can be effectively used by the asymmetric detection-VAE structure. Besides, it is found that GAN [
26] could further improve the performance of the model by generating more realistic hand appearances.
To evaluate the proposed approach, we compare our approach with existing state-of-the-art methods on public hand detection benchmarks, i.e., Oxford hand dataset [
27] and EgoHand dataset [
28]. Experimental results show that our approach achieves the highest detection accuracy among the state-of-the-arts.
It is noted that there exist some related works which utilize generative methods for detection purpose as well, but their contributions are very different from ours. In [
29] VAE is used to reconstruct traffic trajectories so that anomalies can be detected from videos. In [
30] GAN is used to super-resolve small objects so that traffic signs can be detected reliably, and in [
31] hard positive examples are generated by GAN to train a classifier which recognizes anomaly hard examples better. However, we aim at the challenging multiple hands detection problem which is very different from the detection tasks mentioned above. In our approach, the detection and reconstruction branches share the same features, and the detection performance is improved by introducing the shared information contained in the hand appearances reconstruction branch.
The rest of the paper is organized as follows. In
Section 2, related works about hand detection and image reconstruction are reviewed. Details of the proposed approach are provided in
Section 3. Experimental results are presented in
Section 4. The details of our motivation and asymmetric VAE are discussed in
Section 5. A brief conclusion is given in
Section 6.
2. Related Works
Robust human hand detection in an unconstrained complex environment is one of the most challenging tasks in computer vision. The existing hand detection methods can be classified into two categories as follows.
Traditional hand detection methods use artificially designed weak features, such as skin color and shape features. These hand detection algorithms occupy less computational resources. In order to solve the skin color diversity problem caused by human races and illumination change, Girondel et al. [
32] tried several color spaces and found that
Cb and
Cr channels in
YCbCr color space worked well in skin detection task. Sigal et al. [
10] proposed the Gaussian mixture model that performed well in different illumination conditions. But, the skin color based method is susceptible to the background that has similar color of skin. There are also some approaches that use shape features to detect hands. Guo et al. [
12] trained a SVM classifier based on HOG features for detection. Based on [
12], Mittal et al. [
27] developed a mixture of deformable parts based method to detect hand precisely. Additionally, Karlinsky et al. [
33] proposed an approach to locate hands by detecting the relative positions between the hands and other human body parts. However, because of the multi-scale and various rotations of human hands in a single picture, it is difficult to train a model suitable for unconstrained complex environment.
CNN-based detection methods are becoming a popular research topic in computer vision field recently, because higher-level deep features can be learned from the networks. The multi-scale and various rotations problems mentioned above can be well addressed by using CNN. Recent research has focused on three principal directions on developing better object detection systems, and these principals are also suitable for CNN-based hand detection. The three principal directions are introduced as follows.
The first principal direction is to change the base architecture of these networks. The main idea is adopting state-of-the-art convolutional networks to extract robust features, and it not only can benefit the classification but also detect location. Some recent work include ResNet [
34], ResNet-Inception [
35], and FPN [
36] for object detection. Le et al. [
21] proposed to fuse the multi-scale feature map directly for detection and classification to avoid missing small hands. Qing et al. [
37] proposed a feature-map-fused SSD that used deconvolutional to combine deep layers with shallow layers for hand detection. However, the best detection precision in this direction is relatively low, i.e., 75.1% [
21] on Oxford dataset. It is necessary to explore other directions to further improve the detection performance.
The second principal direction is to exploit the data itself by augmenting the diversity characteristics of the training data. Traditionally, the training data can be efficiently augmented by randomly generated spatial transformations [
14,
15], such as random translation, rotation, scaling and cropping, and it is a common technology widely used by the computer vision community. Besides, generative methods can also be utilized for data augmentation in hand-related applications. For example, in [
38] GAN is adopted to augment training data for hand gesture recognition, and in [
39,
40] VAE and GAN are utilized to augment dataset for hand pose estimation. The generative methods can be employed to augment cropped image patches with single hand, but no previous work augments data using generative methods for hand detection task. The reason may be that, an unconstrainted scene usually contains multiple hands, and the contextual information among the hands and background is very hard to be modeled and randomized.
The third principal direction is to utilize contextual reasoning and proxy tasks for reasoning and other top-down mechanisms for improving representations of object detection. He et al. [
16] proposed to use the segmentation as a way to contextually prime object detectors and provided feedback to initial layers. In [
20,
22,
41], a hand rotation information was introduced to improve the detection precision. Deng et al. [
20] adopted CNN to learn the rotation angles of hands, so that the directions of human hands were regulated by a spatial transformation network. Based on the idea of [
20], Li et al. [
22] proposed an embedded implementation of SSD (Single Shot Multi-Boxes) based hand detection and orientation estimation algorithm that can detect hand more efficiently. Narasimhaswamy et al. [
41] introduced a contextual attention module for hand location and orientation estimation in unconstrained images. Most of the related works in this direction improve the representation for hand detection by utilizing the hand rotation information, and the best detection precision in this direction is 83.2% [
22] on Oxford Dataset. However, we argue that the hand detection representation can be further improved by exploring more comprehensive information.
Our work follows this third principal direction. Hand appearances reconstruction is utilized to introduce shared information into our detection framework. Different from previous hand detection tasks, which only introduced rotation or orientation related information, reconstruction can deal with much more complex information of hand, such as scales, shapes, skin colors, and sometimes partial occlusions of hand.
To the best of our knowledge, there is no VAE based method for hand detection. VAE [
24,
42,
43,
44] can be used to reconstruct images effectively, but the VAE methods only minimized mean square error (MSE) between the reconstructed and original images, and the reconstructed image might look blurry. Therefore, we utilize GAN to improve the reconstructed image.
There are some related works that applying generative methods in detecting tasks, but their ideas are very different from ours. Kelathodi et al. [
29] proposed a video object trajectory classification and anomaly detection method, in which VAE was used to reconstruct gradient conversion of trajectories and anomalies were detected by t-SNE. Li et al. [
30] proposed perceptual generative adversarial networks by which small objects (e.g., traffic signs) were super-resolved to narrow the representation differences between objects of various scales. Wang et al. [
31] utilized GAN to generate hard positive examples for training of a classifier which could recognize anomaly hard examples more accurately.
GAN based methods are also utilized in hand-related tasks such as abnormal dynamic gesture recognition [
38] and gesture translation [
45], but these works are very different from the image-based hand detection task which this work focuses on. For example, in [
38] is addressed by dynamic hand gesture classification and abnormal hand gesture detection problemes using GAN, but their method was based on Electromyography (EMG) signals acquired from the forearm muscles (i.e., UC2018 DualMyo and UC2017 SG dataset). In [
45] GAN was adopted to translate one hand gesture to another where only single hand appeared in per image. To the best of our knowledge, the GAN based method has not been used in hand detection applications previously.
5. Discussion
5.1. Motivation of Our Approach
Reliably detecting multiple hands from cluttering scenes is a challenging task because of complex appearance diversities of dexterous human hands in color images, for example, different hand shapes, highly articulated hand poses, skin colors, illuminations, orientations and scales etc. It is impossible to collect training samples evenly cover all these types of hand appearance diversities, and false labels are inevitable. To better understand the motivation of our approach, the details are discussed in the following aspects.
Detection without reconstruction: in this case, the optimizer pursuits merely minimizing the detection loss, and the network learns a model which best fits a specific training dataset. As human hand appearances are very complex in unconstrainted scenarios, the unbalance distribution of training samples, the bias, and false labels of specific training datasets would lead to overfitting and performance degeneration. Detection–only scheme may ignore important hand appearance features which could help generalization.
Detection and reconstruction with VAE: in this case, the optimizer pursuits not only maximizing the detection score, but also reconstructing the hand appearances as much as possible. The learned RoI features contain specific information for hand detection task, as well as comprehensive information for hand appearance reconstruction task. The detection/reconstruction scheme encourages the detector to classify labels and to regress locations based on learned RoI features with comprehensive hand appearance information, but not merely based on features for detection task only. Furthermore, the VAE enforces the learned RoI features (or to say “code”) to a normalized gaussian distribution, and this characteristic helps to alleviate the bias and overfitting problem.
Detection and reconstruction with VAE and GAN: an appealing property of GAN is that its discriminator network implicitly learns a rich similarity metric for appearance construction. By utilizing GAN, local details of the reconstructed hand tend to be finer and sharper. As better local details contribute positively to the localization of hand, our approach with GAN demonstrates more accurate location prediction ability.
5.2. Asymmetric VAE
The asymmetry of our VAE contains two aspects: (1) The input and output are asymmetric. (2) The structure of encoder/decoder is asymmetric.
The input and output are asymmetric: In traditional VAE models, one input image corresponds to one output image with the same size, and the contents of the images are the same. Whereas, in our VAE model, one input image corresponds to multiple output images. The input is the whole image with arbitrary size, and the outputs are image patches of uniform size (i.e., 32 × 32). The content of the input is unconstrained scenario (multiple hands and complex background), and the content of each output is the hand appearance corresponding to each detected region. Our VAE network is different from applying traditional symmetric VAE on the input image with a sliding-window fashion. In sliding-window-based methods, the regions are scanned evenly with constant step. But in our model, the region proposals are generated unevenly, and ratios of the region proposals vary from each other. The proposed method is different from applying traditional symmetric VAE on image patches cropped from detected regions. We do not extract features from cropped image patches, but based on the shared features, a bilinear interpolation algorithm is used to aggregate the features in the region proposal.
The structure of the encoder/decoder is asymmetric: the structure of the encoder and that of the decoder are very different from each other, which is asymmetric too. The encoder contains a shared convolutional module and a region proposal module. Whereas, the decoder only contains five deconvolutional layers and a sigmoid layer (refer to
Figure 3 decoder module). For details, refer to
Section 3.1 and
Section 3.2.