Abstract
Deep learning based object recognition methods have achieved unprecedented success in the recent years. However, this level of success is yet to be achieved on multimodal RGB-D images. The latter can play an important role in several computer vision and robotics applications. In this paper, we present spatial hierarchical analysis deep neural network, called ShaNet, for RGB-D object recognition. Our network consists of convolutional neural network (CNN) and recurrent neural network (RNNs) to analyse and learn distinctive and translationally invariant features in a hierarchical fashion. Unlike existing methods, which employ pre-trained models or rely on transfer learning, our proposed network is trained from scratch on RGB-D data. The proposed model has been tested on two different publicly available RGB-D datasets including Washington RGB-D and 2D3D object dataset. Our experimental results show that the proposed deep neural network achieves superior performance compared to existing RGB-D object recognition methods.
This research is supported by Murdoch University, Australia.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Object recognition is a challenging problem in computer vision, deep learning and robotics [24, 28]. Automatic recognition of unseen objects in complex scenes is a highly desirable characteristic for intelligent systems [29, 31, 32]. Development of vision capabilities involves an off-line training, where training data along with the labels is provided and the intelligent object recognition system then predicts the classes for the unseen examples during test time. To achieve high recognition accuracy, few design considerations are required. For instance, a large number of labeled training examples are required ensure good generalization of the deep neural network. In addition, feature descriptors must be descriptive and representative to mitigate the effect of high variation in inter and intra-class. The intelligent system, at test time, is also required be computationally efficient to ensure real-time recognition for robots.
Traditional object recognition methods use hand-crafted features extracted from 2D images [13, 35]. Recent advances in deep learning methods have shown to achieve good recognition performance for 2D images [16, 17, 34]. The availability of low-cost depth scanners has enabled the extraction of 2.5D/3D information and more representative features from images, however, RGBD data comes with new challenges [36]. For instance, in contrast to the conventional RGB images, the RGB-D data is noisy and incomplete (because of holes) thus posing additional challenges for recognition systems. Additionally, compared to the traditional RGB images the labeled RGB-D training data is also scarce, which further constrains deployment of powerful deep learning techniques for deep neural network training on the RGB-D images. Recent research works have aimed at addressing these problems [4, 18, 22, 25, 26, 29, 33] with a particular emphasis towards the scarcity of large scale annotated training datasets.
In the recent years, feature representation techniques have also rapidly evolved from hand-crafted features to automatic feature learning [27, 37]. The most prevalent methods are based on the deep neural networks which have been shown to achieve the state-of-the-art performance [7, 11, 12, 38]. Deep learning based recognition techniques rely on the features learned by the fully-connected layers, which appear towards the end of the network. Although, fully connected layers contain rich semantic information, they are spatially very coarse [14] and thus need to be complemented by computationally expensive pre-processing steps [7].
In this paper, we address these issues by proposing a deep learning framework, called spatial hierarchical analysis deep neural network (shown in Fig. 1) which consists of a convolutional neural network followed by recurrent neural network applied in hierarchical fashion to extract translationally invariant descriptive features. In contrast to existing deep learning techniques, which rely on transfer learning or pre-trained networks for object recognition, our proposed network is trained from scratch on RGB-D data. The input to our proposed network are RGB-D images captured using Kinect scanner. Initially, the network separately extracts features from each modality. Each image is given as an input to CNN, which extracts the low level features such as edges. The responses of CNN are then given to RNNs. The latter has shown superior performance in text analysis domain such as image captioning and text parsing. In this paper, we explore RNNs for learning high level compositional features from images. Compared to existing RGB-D feature learning methods [4, 8], our approach is computationally efficient and does not need additional input channels such as surface normals.
The contribution of this paper can be summarised as follows:
-
We propose a novel spatial hierarchical analysis deep learning architecture, which extracts low and high level descriptive features and part interactions in hierarchical fashion.
-
The proposed technique is efficient and does not require any additional information channels such surface normals for achieving good performance.
-
The proposed deep network achieves superior performance on two publicly available RGB-D datasets.
The rest of this paper has been organised as follow. Related work is presented in the next section. The proposed technique and experimental results are provided in Sect. 3 and 4, respectively. The paper is concluded in Sect. 5.
2 Related Work
Prior works on object recognition relied on hand-crafted features such as SIFT [23], spin images [15] and kernel-based representation [4] for colour, depth and 3D domains. Spin images [15] are popular 3D shape local features, which have been widely applied to 3D meshes and point cloud for object recognition. Some variants of spin images [1, 22] have also been proposed to improve the original spin images. Fast point feature histogram [6], is a local feature, which has been shown to outperform spin images in 3D object registration. Normal aligned radial features (NARF) [2] extract object boundary cues to perform recognition. These features, however, fail to capture important cues such as edges and size for object recognition. Kernel descriptors [4] are able to generate rich features by turning any pixel attribute to patch-level features [30].
Despite their simplicity, the aforementioned techniques rely on the prior knowledge of the underlying distribution of data that is not readily available in most applications. Recently, automatic feature learning using machine learning approaches has received significant attention. For instance, deep belief nets [7] learn a hierarchy of features by greedily training each layer separately using a restricted Boltzmann machine. Lee et al. [20] proposed convolutional deep belief networks (CDBN) to learn features from the full sized images. CBDN shares the weights between the hidden and visible layers and uses a small receptive field. Convolutional Neural Networks [16] are feed-forward models that have been successfully applied to object/face recognition, face/object detection, character recognition and pose estimation.
Liu et al. [21] proposed guided cross-layer pooling to extract local features using sub-array of convolutional layers. In [12], the concatenated convolutional layers were used in local regions for feature representations. Schwarz et al. [25] used simple colorization scheme of the depth images to perform transfer learning. The drawback of their method is that it ignores the significance of earlier convolutional layers and uses the fully connected layers for feature representation. Gupta et al. [11] encoded the depth modality as HHA, which is the combination of horizontal disparity, height above ground and angle with gravity. However, the limitation of their method is that the proposed embedding is geocentric and such information is not always available in recognition tasks, which are object-centric.
To overcome the limitations of the existing methods, we propose spatial hierarchical analysis deep neural network, ShaNet, which does not require a pre-trained model and transfer learning for the task of object recognition. In addition, the proposed method does not require additional information channels for superior recognition performance.
3 Proposed Spatial Hierarchical Analysis Deep Neural Network
In this section, we describe our proposed Spatial Hierarchical Analysis Network (ShaNet), which learns translationally invariant and distinctive features. The lower hierarchy of the network consists of convolutional neural network (CNN) to achieve translational invariance and the upper hierarchy consists of recurrent neural networks (RNN) to learn more distinctive features.
3.1 Network Initialization and Training
Our proposed deep neural network learns distinctive features in a hierarchical fashion, its appropriate intialization is therefore essential. We perform initialization of our proposed network in two stages. In the first stage, we initialize CNN filters in an unsupervised way using [9]. Given a set of input images, we first extract random patches from these images and normalized them. The extracted patches are then clustered in an unsupervised way using the k-means algorithm.
We use k-mean algorithm because its implementation is not complex, its a computationally efficient approach and does not require tuning of any hyper-parameters. In the second stage, the weights of RNNs are initialized by using the technique proposed by Le et al. [19]. We observed that compared to random initialization of weights, this approach achieves better optimization.
Since our network learns to extract distinctive features during training, its appropriate initialization is critical. A random initialization of the network can make the variance of its output directly proportional to the number of its incoming connections. To alleviate this problem, we use Xavier initialization [10] and randomly initialize the weights with a variance measure that is dependent on the number of incoming and outgoing connections (\(k_{f-in}\) and \(k_{f-out}\) respectively) from a neuron:
where w are network weights. Note that the fan-out measure is used in the variance above to balance the back-propagated signal as well. Xavier initialization works well in our case and leads to better convergence rates.
To avoid over-fitting, we use batch-normalization as our regularization strategy. Given a set of activations \(\{\mathbf {x}^i : i \in [1,a]\}\) (where \(\mathbf {x}^i = \{x^i_j : j \in [1,b]\}\) has b dimensions) from a given layer corresponding to a specific input batch with a images, we compute the first and second order statistics (mean and variance respectively) of the batch for each dimension of activations as follows:
\(\mu _{x_j}\) and \(\sigma ^2_{x_j}\) represent the mean and variance for the \(j^{th}\) activation dimension computed over a batch, respectively. The normalized activation operation is represented as:
We observe that just the normalization of the activations is not sufficient, because it can alter the activations and disrupt the useful patterns that are learned by the network. Therefore, we rescale and shift the normalized activations to allow them to learn useful discriminative representations:
where \(\gamma _j \) and \(\beta _j\) are the learnable parameters which are tuned during error back-propagation.
After the initialization of the proposed model, the CNN filters (shown in Fig. 2) are convolved over the input image to extract features in the lower hierarchy of our deep network. Each input image of size N \(\times \) N is convolved with L square filter of size \(m \times m\), resulting in L filter responses, each of size \((N - m + 1)\times (N - m + 1)\). The CNN applies its nonlinearity as follows. The learned filter responses of size \((N - m + 1)\times (N - m + 1)\) are next average pooled with the square regions of size \(l \times l\) and a stride size of s, to obtain a pooled response with the width and height equal to \(N-l/s+1\).
The output of the CNN is a 3D matrix X of size \(L\times \alpha \times \alpha \) for each input image. For a given 3D matrix X, a block of size \(L\times \beta \times \beta \) consisting of adjacent vectors in the matrix X is defined, as shown in Fig. 3. Note that 4 adjacent vectors are used in the horizontal and vertical directions; \(\beta \) is therefore equal to 4 in this case. As a result, we get a block of size \(L\times 4 \times 4\) where L = 128. The vectors in 3D matrix are then merged step-wise into the parent vector p (as shown in Fig. 3) by mapping the input \(X \,\epsilon \, \mathbb {R}^{128 \times 64 \times 64}\) to a representation \(p \,\epsilon \, \mathbb {R}^{128}\), as follows:
where \(W^{(i)}\), i = 1,2,3,... is the parameter matrix, f(.) is a non-linear activation function (sigmoid in this case), b is the bias vector, and \(p^{(1)}\), \(p^{(2)}\) and \(p^{(3)}\) are matrices of dimension \(\mathbb {R}^{L\times \alpha /4\times \alpha /4}\), \( \mathbb {R}^{L\times \alpha /16\times \alpha /16}\) and \(\mathbb {R}^{L}\), respectively. In our implementation, vector p is used as the feature vector to a softmax classifier. The input, output sizes and the parameters of our proposed network are reported in Table 1.
4 Experimental Results
The proposed deep neural network is evaluated on the publicly available Washington RGB-D [18] and 2D3D [6] datasets, which are widely used for benchmarking RGB-D object recognition techniques. In the following, we will briefly describe the datasets and compare our method against several state-of-the-art algorithms.
4.1 Washington RGB-D Object Dataset
Washington RGB-D dataset contains 300 household object instances which are organized into 51 categories. Each instance is captured using Kinetic scanner on a revolving turntable from three elevation angles (30\(^{\circ }\), 45\(^{\circ }\) and 60\(^{\circ }\)). We follow the experimental setup of Lai et al. [18] in our evaluation and use the same training/ testing splits and the cropped images as suggested by Lai et al. [18]. We then compute LBP features for each image and pass the image to the network for feature learning. Our object recognition results and comparison with state-of-the-art is reported in Table 2. Our proposed technique achieves object recognition accuracy of 89.8% on RGB-D images, the second best performance is achieved by CNN-colourized. Note that our approach achieves superior performance for all the modalities compared to existing RGB-D object recognition methods.
4.2 2D3D Object Dataset
2D3D object dataset contains 16 different categories of highly textured common objects (e.g. drink cartons, computer monitors). We follow the experimental protocol of Browatzki et al. [6] for a fair comparison. Due to the small number of examples, we specifically combine the spoon, knife and fork classes into a joint class of silverware and exclude phone and perforator. This makes a final dataset of 156 instances and 14 classes for category recognition. Our experimental results are reported in Table 3. The proposed approach achieves better performance compared to state-of-the-art methods.
The superior performance of the proposed network can be attributed to the hierarchical architecture of the deep neural network, which learns translationally invariant and distinctive features in the lower and higher levels of the architecture, respectively.
4.3 Computation/Implementation Details
These experiments were run on high performance computing devices with NVIDIA Titan V GPU and 128Â GB RAM. Our code was implemented in MATLAB.
5 Conclusion and Future Directions
In this paper, we proposed a spatial hierarchical analysis deep neural network for RGB-D object recognition. The proposed network consists of CNN and RNNs to learn distinctive features in a hierarchical fashion. The tanslationally invariant features of CNN are analyzed and merged systematically using RNNs to get the most representative and descriptive feature for a given input image. The proposed technique has been tested on two publicly available RGB-D datasets for the task of object recognition. Our proposed deep neural network achieves state-of-the-art performance on these datasets.
In our implementation, our CNN generates a 3D matrix of size \(128 \times 64 \times 64\), which is merged to get a final feature vector of size \(128 \times 1\). As a future work, we intend to test a 3D matrix of higher dimensions and instead of merging 4 adjacent vectors (as done in this work), we intend to choose a bigger neighbourhood for combining these vectors. This will require more RNNs in the architecture and computational resources. In our technique, we have used sigmoid activation function, however, we believe that recognition performance can be further increased by using ReLU activation function.
References
Asif, U., Bennamoun, M., Sohel, F.: Efficient RGB-D object categorization using cascaded ensembles of randomized decision trees. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1295–1302. IEEE (2015)
Bai, J., Wu, Y., Zhang, J., Chen, F.: Subset based deep learning for RGB-D object recognition. Neurocomputing 165, 280–292 (2015)
Blum, M., Springenberg, J.T., Wülfing, J., Riedmiller, M.: A learned feature descriptor for object recognition in RGB-D data. In: 2012 IEEE International Conference on Robotics and Automation, pp. 1298–1303. IEEE (2012)
Bo, L., Ren, X., Fox, D.: Depth kernel descriptors for object recognition. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 821–826. IEEE (2011)
Bo, L., Ren, X., Fox, D.: Unsupervised feature learning for RGB-D based object recognition. In: Desai, J., Dudek, G., Khatib, O., Kumar, V. (eds.) Experimental Robotics. Springer Tracts in Advanced Robotics, vol. 88, pp. 387–402. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-319-00065-7_27
Browatzki, B., Fischer, J., Graf, B., Bülthoff, H.H., Wallraven, C.: Going into depth: evaluating 2D and 3D cues for object classification on a new, large-scale object dataset. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1189–1195. IEEE (2011)
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)
Cheng, Y., Zhao, X., Huang, K., Tan, T.: Semi-supervised learning for RGB-D object recognition. In: 2014 22nd International Conference on Pattern Recognition, pp. 2377–2382. IEEE (2014)
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223 (2011)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23
Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 447–456 (2015)
Hu, H., Shah, S.A.A., Bennamoun, M., Molton, M.: 2D and 3D face recognition using convolutional neural network. In: TENCON 2017–2017 IEEE Region 10 Conference, pp. 133–132. IEEE (2017)
Jhuo, I.-H., Gao, S., Zhuang, L., Lee, D.T., Ma, Y.: Unsupervised feature learning for RGB-D image classification. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9003, pp. 276–289. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16865-4_18
Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999)
Khan, S., Rahmani, H., Shah, S.A.A., Bennamoun, M.: A guide to convolutional neural networks for computer vision. Synth. Lect. Comput. Vis. 8(1), 1–207 (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view RGB-D object dataset. In: 2011 IEEE International Conference on Robotics and Automation, pp. 1817–1824. IEEE (2011)
Le, Q.V., Jaitly, N., Hinton, G.E.: A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941 (2015)
Lee, H., Pham, P., Largman, Y., Ng, A.Y.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in Neural Information Processing Systems, pp. 1096–1104 (2009)
Liu, L., Shen, C., van den Hengel, A.: The treasure beneath convolutional layers: cross-convolutional-layer pooling for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4749–4757 (2015)
Liu, W., Ji, R., Li, S.: Towards 3D object detection with bimodal deep Boltzmann machines over RGBD imagery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3013–3021 (2015)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Nadeem, U., Shah, S.A.A., Bennamoun, M., Togneri, R., Sohel, F.: Image set classification for low resolution surveillance. arXiv preprint arXiv:1803.09470 (2018)
Schwarz, M., Schulz, H., Behnke, S.: RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1329–1335. IEEE (2015)
Shah, S., Bennamoun, M., Boussaid, F., El-Sallam, A.: A novel local surface description for automatic 3D object recognition in low resolution cluttered scenes. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 638–643 (2013)
Shah, S.A., Nadeem, U., Bennamoun, M., Sohel, F., Togneri, R.: Efficient image set classification using linear regression based image reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 99–108 (2017)
Shah, S.A.A., Bennamoun, M., Boussaid, F.: Performance evaluation of 3D local surface descriptors for low and high resolution range image registration. In: 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. IEEE (2014)
Shah, S.A.A., Bennamoun, M., Boussaid, F.: A novel 3D vorticity based approach for automatic registration of low resolution range images. Pattern Recogn. 48(9), 2859–2871 (2015)
Shah, S.A.A., Bennamoun, M., Boussaid, F.: Iterative deep learning for image set based face and object recognition. Neurocomputing 174, 866–874 (2016)
Shah, S.A.A., Bennamoun, M., Boussaid, F.: A novel feature representation for automatic 3D object recognition in cluttered scenes. Neurocomputing 205, 1–15 (2016)
Shah, S.A.A., Bennamoun, M., Boussaid, F.: Keypoints-based surface representation for 3D modeling and 3D object recognition. Pattern Recogn. 64, 29–38 (2017)
Shah, S.A.A., Bennamoun, M., Boussaid, F., El-Sallam, A.A.: 3D-div: a novel local surface descriptor for feature matching and pairwise range image registration. In: 2013 IEEE International Conference on Image Processing, pp. 2934–2938. IEEE (2013)
Shah, S.A.A., Bennamoun, M., Boussaid, F., El-Sallam, A.A.: Automatic object detection using objectness measure. In: 2013 1st International Conference on Communications, Signal Processing, and Their Applications (ICCSPA), pp. 1–6. IEEE (2013)
Shah, S.A.A., Bennamoun, M., Boussaid, F., While, L.: Evolutionary feature learning for 3-D object recognition. IEEE Access 6, 2434–2444 (2017)
Shah, S.A.A., Bennamoun, M., Molton, M.: A fully automatic framework for prediction of 3D facial rejuvenation. In: 2018 International Conference on Image and Vision Computing New Zealand (IVCNZ), pp. 1–6. IEEE (2018)
Shah, S.A.A., Bennamoun, M., Molton, M.K.: Machine learning approaches for prediction of facial rejuvenation using real and synthetic data. IEEE Access 7, 23779–23787 (2019)
Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)
Zaki, H.F., Shafait, F., Mian, A.: Localized deep extreme learning machines for efficient RGB-D object recognition. In: 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8. IEEE (2015)
Acknowledgment
The author would like to thank NVIDIA for their Titan-V GPU donation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Shah, S.A.A. (2020). Spatial Hierarchical Analysis Deep Neural Network for RGB-D Object Recognition. In: Dabrowski, J., Rahman, A., Paul, M. (eds) Image and Video Technology. PSIVT 2019. Lecture Notes in Computer Science(), vol 11994. Springer, Cham. https://doi.org/10.1007/978-3-030-39770-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-39770-8_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39769-2
Online ISBN: 978-3-030-39770-8
eBook Packages: Computer ScienceComputer Science (R0)