Keywords

1 Introduction

Object recognition is a challenging problem in computer vision, deep learning and robotics [24, 28]. Automatic recognition of unseen objects in complex scenes is a highly desirable characteristic for intelligent systems [29, 31, 32]. Development of vision capabilities involves an off-line training, where training data along with the labels is provided and the intelligent object recognition system then predicts the classes for the unseen examples during test time. To achieve high recognition accuracy, few design considerations are required. For instance, a large number of labeled training examples are required ensure good generalization of the deep neural network. In addition, feature descriptors must be descriptive and representative to mitigate the effect of high variation in inter and intra-class. The intelligent system, at test time, is also required be computationally efficient to ensure real-time recognition for robots.

Traditional object recognition methods use hand-crafted features extracted from 2D images [13, 35]. Recent advances in deep learning methods have shown to achieve good recognition performance for 2D images [16, 17, 34]. The availability of low-cost depth scanners has enabled the extraction of 2.5D/3D information and more representative features from images, however, RGBD data comes with new challenges [36]. For instance, in contrast to the conventional RGB images, the RGB-D data is noisy and incomplete (because of holes) thus posing additional challenges for recognition systems. Additionally, compared to the traditional RGB images the labeled RGB-D training data is also scarce, which further constrains deployment of powerful deep learning techniques for deep neural network training on the RGB-D images. Recent research works have aimed at addressing these problems [4, 18, 22, 25, 26, 29, 33] with a particular emphasis towards the scarcity of large scale annotated training datasets.

In the recent years, feature representation techniques have also rapidly evolved from hand-crafted features to automatic feature learning [27, 37]. The most prevalent methods are based on the deep neural networks which have been shown to achieve the state-of-the-art performance [7, 11, 12, 38]. Deep learning based recognition techniques rely on the features learned by the fully-connected layers, which appear towards the end of the network. Although, fully connected layers contain rich semantic information, they are spatially very coarse [14] and thus need to be complemented by computationally expensive pre-processing steps [7].

Fig. 1.
figure 1

Block diagram of our proposed network. The input image is given to CNN, which consists of convolutional and average pooling layers. The CNN produces a 3D matrix, which is given as input to RNNs. The latter learn to generate the final feature vector.

In this paper, we address these issues by proposing a deep learning framework, called spatial hierarchical analysis deep neural network (shown in Fig. 1) which consists of a convolutional neural network followed by recurrent neural network applied in hierarchical fashion to extract translationally invariant descriptive features. In contrast to existing deep learning techniques, which rely on transfer learning or pre-trained networks for object recognition, our proposed network is trained from scratch on RGB-D data. The input to our proposed network are RGB-D images captured using Kinect scanner. Initially, the network separately extracts features from each modality. Each image is given as an input to CNN, which extracts the low level features such as edges. The responses of CNN are then given to RNNs. The latter has shown superior performance in text analysis domain such as image captioning and text parsing. In this paper, we explore RNNs for learning high level compositional features from images. Compared to existing RGB-D feature learning methods [4, 8], our approach is computationally efficient and does not need additional input channels such as surface normals.

The contribution of this paper can be summarised as follows:

  • We propose a novel spatial hierarchical analysis deep learning architecture, which extracts low and high level descriptive features and part interactions in hierarchical fashion.

  • The proposed technique is efficient and does not require any additional information channels such surface normals for achieving good performance.

  • The proposed deep network achieves superior performance on two publicly available RGB-D datasets.

The rest of this paper has been organised as follow. Related work is presented in the next section. The proposed technique and experimental results are provided in Sect. 3 and 4, respectively. The paper is concluded in Sect. 5.

2 Related Work

Prior works on object recognition relied on hand-crafted features such as SIFT [23], spin images [15] and kernel-based representation [4] for colour, depth and 3D domains. Spin images [15] are popular 3D shape local features, which have been widely applied to 3D meshes and point cloud for object recognition. Some variants of spin images [1, 22] have also been proposed to improve the original spin images. Fast point feature histogram [6], is a local feature, which has been shown to outperform spin images in 3D object registration. Normal aligned radial features (NARF) [2] extract object boundary cues to perform recognition. These features, however, fail to capture important cues such as edges and size for object recognition. Kernel descriptors [4] are able to generate rich features by turning any pixel attribute to patch-level features [30].

Despite their simplicity, the aforementioned techniques rely on the prior knowledge of the underlying distribution of data that is not readily available in most applications. Recently, automatic feature learning using machine learning approaches has received significant attention. For instance, deep belief nets [7] learn a hierarchy of features by greedily training each layer separately using a restricted Boltzmann machine. Lee et al. [20] proposed convolutional deep belief networks (CDBN) to learn features from the full sized images. CBDN shares the weights between the hidden and visible layers and uses a small receptive field. Convolutional Neural Networks [16] are feed-forward models that have been successfully applied to object/face recognition, face/object detection, character recognition and pose estimation.

Liu et al. [21] proposed guided cross-layer pooling to extract local features using sub-array of convolutional layers. In [12], the concatenated convolutional layers were used in local regions for feature representations. Schwarz et al. [25] used simple colorization scheme of the depth images to perform transfer learning. The drawback of their method is that it ignores the significance of earlier convolutional layers and uses the fully connected layers for feature representation. Gupta et al. [11] encoded the depth modality as HHA, which is the combination of horizontal disparity, height above ground and angle with gravity. However, the limitation of their method is that the proposed embedding is geocentric and such information is not always available in recognition tasks, which are object-centric.

To overcome the limitations of the existing methods, we propose spatial hierarchical analysis deep neural network, ShaNet, which does not require a pre-trained model and transfer learning for the task of object recognition. In addition, the proposed method does not require additional information channels for superior recognition performance.

Fig. 2.
figure 2

CNN filters visualization for RGB (left) and Depth (right) images. Only few filters learnt by our model are shown here.

3 Proposed Spatial Hierarchical Analysis Deep Neural Network

In this section, we describe our proposed Spatial Hierarchical Analysis Network (ShaNet), which learns translationally invariant and distinctive features. The lower hierarchy of the network consists of convolutional neural network (CNN) to achieve translational invariance and the upper hierarchy consists of recurrent neural networks (RNN) to learn more distinctive features.

3.1 Network Initialization and Training

Our proposed deep neural network learns distinctive features in a hierarchical fashion, its appropriate intialization is therefore essential. We perform initialization of our proposed network in two stages. In the first stage, we initialize CNN filters in an unsupervised way using [9]. Given a set of input images, we first extract random patches from these images and normalized them. The extracted patches are then clustered in an unsupervised way using the k-means algorithm.

We use k-mean algorithm because its implementation is not complex, its a computationally efficient approach and does not require tuning of any hyper-parameters. In the second stage, the weights of RNNs are initialized by using the technique proposed by Le et al. [19]. We observed that compared to random initialization of weights, this approach achieves better optimization.

Since our network learns to extract distinctive features during training, its appropriate initialization is critical. A random initialization of the network can make the variance of its output directly proportional to the number of its incoming connections. To alleviate this problem, we use Xavier initialization [10] and randomly initialize the weights with a variance measure that is dependent on the number of incoming and outgoing connections (\(k_{f-in}\) and \(k_{f-out}\) respectively) from a neuron:

$$\begin{aligned} Var(w) = \frac{2}{n_{f-in} + n_{f-out}}, \end{aligned}$$
(1)

where w are network weights. Note that the fan-out measure is used in the variance above to balance the back-propagated signal as well. Xavier initialization works well in our case and leads to better convergence rates.

To avoid over-fitting, we use batch-normalization as our regularization strategy. Given a set of activations \(\{\mathbf {x}^i : i \in [1,a]\}\) (where \(\mathbf {x}^i = \{x^i_j : j \in [1,b]\}\) has b dimensions) from a given layer corresponding to a specific input batch with a images, we compute the first and second order statistics (mean and variance respectively) of the batch for each dimension of activations as follows:

$$\begin{aligned} \mu _{x_j} = \frac{1}{m}\sum _{i=1}^{m} x_j^i \sigma ^2_{x_j} = \frac{1}{m}\sum _{i=1}^{m} (x_j^i - \mu _{x_j})^2 \end{aligned}$$
(2)

\(\mu _{x_j}\) and \(\sigma ^2_{x_j}\) represent the mean and variance for the \(j^{th}\) activation dimension computed over a batch, respectively. The normalized activation operation is represented as:

$$\begin{aligned} \hat{x}_j^i = \frac{x_j^i - \mu _{x_j}}{\sqrt{\sigma ^2_{x_j} + \epsilon }}. \end{aligned}$$
(3)

We observe that just the normalization of the activations is not sufficient, because it can alter the activations and disrupt the useful patterns that are learned by the network. Therefore, we rescale and shift the normalized activations to allow them to learn useful discriminative representations:

$$\begin{aligned} y_j^i = \gamma _j \hat{x}_j^i + \beta _j, \end{aligned}$$
(4)

where \(\gamma _j \) and \(\beta _j\) are the learnable parameters which are tuned during error back-propagation.

After the initialization of the proposed model, the CNN filters (shown in Fig. 2) are convolved over the input image to extract features in the lower hierarchy of our deep network. Each input image of size N \(\times \) N is convolved with L square filter of size \(m \times m\), resulting in L filter responses, each of size \((N - m + 1)\times (N - m + 1)\). The CNN applies its nonlinearity as follows. The learned filter responses of size \((N - m + 1)\times (N - m + 1)\) are next average pooled with the square regions of size \(l \times l\) and a stride size of s, to obtain a pooled response with the width and height equal to \(N-l/s+1\).

Fig. 3.
figure 3

Spatial hierarchical analysis network feature learning. 3D Matrix X (left most) from the CNN is given to hierarchy of RNNs, which merge 4 adjacent vectors to get the final feature p (right most).

The output of the CNN is a 3D matrix X of size \(L\times \alpha \times \alpha \) for each input image. For a given 3D matrix X, a block of size \(L\times \beta \times \beta \) consisting of adjacent vectors in the matrix X is defined, as shown in Fig. 3. Note that 4 adjacent vectors are used in the horizontal and vertical directions; \(\beta \) is therefore equal to 4 in this case. As a result, we get a block of size \(L\times 4 \times 4\) where L = 128. The vectors in 3D matrix are then merged step-wise into the parent vector p (as shown in Fig. 3) by mapping the input \(X \,\epsilon \, \mathbb {R}^{128 \times 64 \times 64}\) to a representation \(p \,\epsilon \, \mathbb {R}^{128}\), as follows:

$$\begin{aligned} p^{(1)} = f(W^{(1)})X + b^{(1)} \end{aligned}$$
(5)
$$\begin{aligned} p^{(2)} = f(W^{(2)})X + b^{(2)} \end{aligned}$$
(6)
$$\begin{aligned} p = f(W^{(3)})X + b^{(3)} \end{aligned}$$
(7)

where \(W^{(i)}\), i = 1,2,3,... is the parameter matrix, f(.) is a non-linear activation function (sigmoid in this case), b is the bias vector, and \(p^{(1)}\), \(p^{(2)}\) and \(p^{(3)}\) are matrices of dimension \(\mathbb {R}^{L\times \alpha /4\times \alpha /4}\), \( \mathbb {R}^{L\times \alpha /16\times \alpha /16}\) and \(\mathbb {R}^{L}\), respectively. In our implementation, vector p is used as the feature vector to a softmax classifier. The input, output sizes and the parameters of our proposed network are reported in Table 1.

Table 1. Input, Output (feature sizes) and parameters of the proposed Spatial Hierarchical Analysis Deep Neural Network.

4 Experimental Results

The proposed deep neural network is evaluated on the publicly available Washington RGB-D [18] and 2D3D [6] datasets, which are widely used for benchmarking RGB-D object recognition techniques. In the following, we will briefly describe the datasets and compare our method against several state-of-the-art algorithms.

4.1 Washington RGB-D Object Dataset

Washington RGB-D dataset contains 300 household object instances which are organized into 51 categories. Each instance is captured using Kinetic scanner on a revolving turntable from three elevation angles (30\(^{\circ }\), 45\(^{\circ }\) and 60\(^{\circ }\)). We follow the experimental setup of Lai et al. [18] in our evaluation and use the same training/ testing splits and the cropped images as suggested by Lai et al. [18]. We then compute LBP features for each image and pass the image to the network for feature learning. Our object recognition results and comparison with state-of-the-art is reported in Table 2. Our proposed technique achieves object recognition accuracy of 89.8% on RGB-D images, the second best performance is achieved by CNN-colourized. Note that our approach achieves superior performance for all the modalities compared to existing RGB-D object recognition methods.

4.2 2D3D Object Dataset

2D3D object dataset contains 16 different categories of highly textured common objects (e.g. drink cartons, computer monitors). We follow the experimental protocol of Browatzki et al. [6] for a fair comparison. Due to the small number of examples, we specifically combine the spoon, knife and fork classes into a joint class of silverware and exclude phone and perforator. This makes a final dataset of 156 instances and 14 classes for category recognition. Our experimental results are reported in Table 3. The proposed approach achieves better performance compared to state-of-the-art methods.

Table 2. Performance comparison in terms of recognition accuracy (in %) of the proposed technique with state-of-the-art methods on Washington RGB-D object dataset. The reported accuracy is an average over 10 trials.
Table 3. Performance comparison in terms of recognition accuracy (in %) of the proposed technique with state-of-the-art methods on 2D3D Object Dataset.

The superior performance of the proposed network can be attributed to the hierarchical architecture of the deep neural network, which learns translationally invariant and distinctive features in the lower and higher levels of the architecture, respectively.

4.3 Computation/Implementation Details

These experiments were run on high performance computing devices with NVIDIA Titan V GPU and 128 GB RAM. Our code was implemented in MATLAB.

5 Conclusion and Future Directions

In this paper, we proposed a spatial hierarchical analysis deep neural network for RGB-D object recognition. The proposed network consists of CNN and RNNs to learn distinctive features in a hierarchical fashion. The tanslationally invariant features of CNN are analyzed and merged systematically using RNNs to get the most representative and descriptive feature for a given input image. The proposed technique has been tested on two publicly available RGB-D datasets for the task of object recognition. Our proposed deep neural network achieves state-of-the-art performance on these datasets.

In our implementation, our CNN generates a 3D matrix of size \(128 \times 64 \times 64\), which is merged to get a final feature vector of size \(128 \times 1\). As a future work, we intend to test a 3D matrix of higher dimensions and instead of merging 4 adjacent vectors (as done in this work), we intend to choose a bigger neighbourhood for combining these vectors. This will require more RNNs in the architecture and computational resources. In our technique, we have used sigmoid activation function, however, we believe that recognition performance can be further increased by using ReLU activation function.