Spatial Hierarchical Analysis Deep Neural Network for RGB-D Object Recognition

Shah, Syed Afaq Ali

doi:10.1007/978-3-030-39770-8_15

Syed Afaq Ali Shah¹¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11994))

Included in the following conference series:

Pacific-Rim Symposium on Image and Video Technology

814 Accesses

Abstract

Deep learning based object recognition methods have achieved unprecedented success in the recent years. However, this level of success is yet to be achieved on multimodal RGB-D images. The latter can play an important role in several computer vision and robotics applications. In this paper, we present spatial hierarchical analysis deep neural network, called ShaNet, for RGB-D object recognition. Our network consists of convolutional neural network (CNN) and recurrent neural network (RNNs) to analyse and learn distinctive and translationally invariant features in a hierarchical fashion. Unlike existing methods, which employ pre-trained models or rely on transfer learning, our proposed network is trained from scratch on RGB-D data. The proposed model has been tested on two different publicly available RGB-D datasets including Washington RGB-D and 2D3D object dataset. Our experimental results show that the proposed deep neural network achieves superior performance compared to existing RGB-D object recognition methods.

This research is supported by Murdoch University, Australia.

You have full access to this open access chapter, Download conference paper PDF

Exploiting Multi-layer Features Using a CNN-RNN Approach for RGB-D Object Recognition

3D Object Recognition Using Convolutional Neural Networks with Transfer Learning Between Input Channels

RGB-D Scene Classification via Multi-modal Feature Learning

Article 02 August 2018

Keywords

1 Introduction

Object recognition is a challenging problem in computer vision, deep learning and robotics [24, 28]. Automatic recognition of unseen objects in complex scenes is a highly desirable characteristic for intelligent systems [29, 31, 32]. Development of vision capabilities involves an off-line training, where training data along with the labels is provided and the intelligent object recognition system then predicts the classes for the unseen examples during test time. To achieve high recognition accuracy, few design considerations are required. For instance, a large number of labeled training examples are required ensure good generalization of the deep neural network. In addition, feature descriptors must be descriptive and representative to mitigate the effect of high variation in inter and intra-class. The intelligent system, at test time, is also required be computationally efficient to ensure real-time recognition for robots.

Traditional object recognition methods use hand-crafted features extracted from 2D images [13, 35]. Recent advances in deep learning methods have shown to achieve good recognition performance for 2D images [16, 17, 34]. The availability of low-cost depth scanners has enabled the extraction of 2.5D/3D information and more representative features from images, however, RGBD data comes with new challenges [36]. For instance, in contrast to the conventional RGB images, the RGB-D data is noisy and incomplete (because of holes) thus posing additional challenges for recognition systems. Additionally, compared to the traditional RGB images the labeled RGB-D training data is also scarce, which further constrains deployment of powerful deep learning techniques for deep neural network training on the RGB-D images. Recent research works have aimed at addressing these problems [4, 18, 22, 25, 26, 29, 33] with a particular emphasis towards the scarcity of large scale annotated training datasets.

In the recent years, feature representation techniques have also rapidly evolved from hand-crafted features to automatic feature learning [27, 37]. The most prevalent methods are based on the deep neural networks which have been shown to achieve the state-of-the-art performance [7, 11, 12, 38]. Deep learning based recognition techniques rely on the features learned by the fully-connected layers, which appear towards the end of the network. Although, fully connected layers contain rich semantic information, they are spatially very coarse [14] and thus need to be complemented by computationally expensive pre-processing steps [7].

In this paper, we address these issues by proposing a deep learning framework, called spatial hierarchical analysis deep neural network (shown in Fig. 1) which consists of a convolutional neural network followed by recurrent neural network applied in hierarchical fashion to extract translationally invariant descriptive features. In contrast to existing deep learning techniques, which rely on transfer learning or pre-trained networks for object recognition, our proposed network is trained from scratch on RGB-D data. The input to our proposed network are RGB-D images captured using Kinect scanner. Initially, the network separately extracts features from each modality. Each image is given as an input to CNN, which extracts the low level features such as edges. The responses of CNN are then given to RNNs. The latter has shown superior performance in text analysis domain such as image captioning and text parsing. In this paper, we explore RNNs for learning high level compositional features from images. Compared to existing RGB-D feature learning methods [4, 8], our approach is computationally efficient and does not need additional input channels such as surface normals.

The contribution of this paper can be summarised as follows:

We propose a novel spatial hierarchical analysis deep learning architecture, which extracts low and high level descriptive features and part interactions in hierarchical fashion.
The proposed technique is efficient and does not require any additional information channels such surface normals for achieving good performance.
The proposed deep network achieves superior performance on two publicly available RGB-D datasets.

The rest of this paper has been organised as follow. Related work is presented in the next section. The proposed technique and experimental results are provided in Sect. 3 and 4, respectively. The paper is concluded in Sect. 5.

2 Related Work

Prior works on object recognition relied on hand-crafted features such as SIFT [23], spin images [15] and kernel-based representation [4] for colour, depth and 3D domains. Spin images [15] are popular 3D shape local features, which have been widely applied to 3D meshes and point cloud for object recognition. Some variants of spin images [1, 22] have also been proposed to improve the original spin images. Fast point feature histogram [6], is a local feature, which has been shown to outperform spin images in 3D object registration. Normal aligned radial features (NARF) [2] extract object boundary cues to perform recognition. These features, however, fail to capture important cues such as edges and size for object recognition. Kernel descriptors [4] are able to generate rich features by turning any pixel attribute to patch-level features [30].

Despite their simplicity, the aforementioned techniques rely on the prior knowledge of the underlying distribution of data that is not readily available in most applications. Recently, automatic feature learning using machine learning approaches has received significant attention. For instance, deep belief nets [7] learn a hierarchy of features by greedily training each layer separately using a restricted Boltzmann machine. Lee et al. [20] proposed convolutional deep belief networks (CDBN) to learn features from the full sized images. CBDN shares the weights between the hidden and visible layers and uses a small receptive field. Convolutional Neural Networks [16] are feed-forward models that have been successfully applied to object/face recognition, face/object detection, character recognition and pose estimation.

Liu et al. [21] proposed guided cross-layer pooling to extract local features using sub-array of convolutional layers. In [12], the concatenated convolutional layers were used in local regions for feature representations. Schwarz et al. [25] used simple colorization scheme of the depth images to perform transfer learning. The drawback of their method is that it ignores the significance of earlier convolutional layers and uses the fully connected layers for feature representation. Gupta et al. [11] encoded the depth modality as HHA, which is the combination of horizontal disparity, height above ground and angle with gravity. However, the limitation of their method is that the proposed embedding is geocentric and such information is not always available in recognition tasks, which are object-centric.

To overcome the limitations of the existing methods, we propose spatial hierarchical analysis deep neural network, ShaNet, which does not require a pre-trained model and transfer learning for the task of object recognition. In addition, the proposed method does not require additional information channels for superior recognition performance.

3 Proposed Spatial Hierarchical Analysis Deep Neural Network

In this section, we describe our proposed Spatial Hierarchical Analysis Network (ShaNet), which learns translationally invariant and distinctive features. The lower hierarchy of the network consists of convolutional neural network (CNN) to achieve translational invariance and the upper hierarchy consists of recurrent neural networks (RNN) to learn more distinctive features.

3.1 Network Initialization and Training

Our proposed deep neural network learns distinctive features in a hierarchical fashion, its appropriate intialization is therefore essential. We perform initialization of our proposed network in two stages. In the first stage, we initialize CNN filters in an unsupervised way using [9]. Given a set of input images, we first extract random patches from these images and normalized them. The extracted patches are then clustered in an unsupervised way using the k-means algorithm.

We use k-mean algorithm because its implementation is not complex, its a computationally efficient approach and does not require tuning of any hyper-parameters. In the second stage, the weights of RNNs are initialized by using the technique proposed by Le et al. [19]. We observed that compared to random initialization of weights, this approach achieves better optimization.

Since our network learns to extract distinctive features during training, its appropriate initialization is critical. A random initialization of the network can make the variance of its output directly proportional to the number of its incoming connections. To alleviate this problem, we use Xavier initialization [10] and randomly initialize the weights with a variance measure that is dependent on the number of incoming and outgoing connections ($k_{f-in}$ and $k_{f-out}$ respectively) from a neuron:

$$\begin{aligned} Var(w) = \frac{2}{n_{f-in} + n_{f-out}}, \end{aligned}$$

(1)

where w are network weights. Note that the fan-out measure is used in the variance above to balance the back-propagated signal as well. Xavier initialization works well in our case and leads to better convergence rates.

To avoid over-fitting, we use batch-normalization as our regularization strategy. Given a set of activations $\{\mathbf {x}^i : i \in [1,a]\}$ (where $\mathbf {x}^i = \{x^i_j : j \in [1,b]\}$ has b dimensions) from a given layer corresponding to a specific input batch with a images, we compute the first and second order statistics (mean and variance respectively) of the batch for each dimension of activations as follows:

$$\begin{aligned} \mu _{x_j} = \frac{1}{m}\sum _{i=1}^{m} x_j^i \sigma ^2_{x_j} = \frac{1}{m}\sum _{i=1}^{m} (x_j^i - \mu _{x_j})^2 \end{aligned}$$

(2)

$\mu _{x_j}$ and $\sigma ^2_{x_j}$ represent the mean and variance for the $j^{th}$ activation dimension computed over a batch, respectively. The normalized activation operation is represented as:

$$\begin{aligned} \hat{x}_j^i = \frac{x_j^i - \mu _{x_j}}{\sqrt{\sigma ^2_{x_j} + \epsilon }}. \end{aligned}$$

(3)

We observe that just the normalization of the activations is not sufficient, because it can alter the activations and disrupt the useful patterns that are learned by the network. Therefore, we rescale and shift the normalized activations to allow them to learn useful discriminative representations:

$$\begin{aligned} y_j^i = \gamma _j \hat{x}_j^i + \beta _j, \end{aligned}$$

(4)

where $\gamma _j $ and $\beta _j$ are the learnable parameters which are tuned during error back-propagation.

After the initialization of the proposed model, the CNN filters (shown in Fig. 2) are convolved over the input image to extract features in the lower hierarchy of our deep network. Each input image of size N $\times $ N is convolved with L square filter of size $m \times m$, resulting in L filter responses, each of size $(N - m + 1)\times (N - m + 1)$. The CNN applies its nonlinearity as follows. The learned filter responses of size $(N - m + 1)\times (N - m + 1)$ are next average pooled with the square regions of size $l \times l$ and a stride size of s, to obtain a pooled response with the width and height equal to $N-l/s+1$.

The output of the CNN is a 3D matrix X of size $L\times \alpha \times \alpha $ for each input image. For a given 3D matrix X, a block of size $L\times \beta \times \beta $ consisting of adjacent vectors in the matrix X is defined, as shown in Fig. 3. Note that 4 adjacent vectors are used in the horizontal and vertical directions; $\beta $ is therefore equal to 4 in this case. As a result, we get a block of size $L\times 4 \times 4$ where L = 128. The vectors in 3D matrix are then merged step-wise into the parent vector p (as shown in Fig. 3) by mapping the input $X \,\epsilon \, \mathbb {R}^{128 \times 64 \times 64}$ to a representation $p \,\epsilon \, \mathbb {R}^{128}$, as follows:

$$\begin{aligned} p^{(1)} = f(W^{(1)})X + b^{(1)} \end{aligned}$$

(5)

$$\begin{aligned} p^{(2)} = f(W^{(2)})X + b^{(2)} \end{aligned}$$

(6)

$$\begin{aligned} p = f(W^{(3)})X + b^{(3)} \end{aligned}$$

(7)

where $W^{(i)}$, i = 1,2,3,... is the parameter matrix, f(.) is a non-linear activation function (sigmoid in this case), b is the bias vector, and $p^{(1)}$, $p^{(2)}$ and $p^{(3)}$ are matrices of dimension $\mathbb {R}^{L\times \alpha /4\times \alpha /4}$, $ \mathbb {R}^{L\times \alpha /16\times \alpha /16}$ and $\mathbb {R}^{L}$, respectively. In our implementation, vector p is used as the feature vector to a softmax classifier. The input, output sizes and the parameters of our proposed network are reported in Table 1.

Table 1. Input, Output (feature sizes) and parameters of the proposed Spatial Hierarchical Analysis Deep Neural Network.

Full size table

4 Experimental Results

The proposed deep neural network is evaluated on the publicly available Washington RGB-D [18] and 2D3D [6] datasets, which are widely used for benchmarking RGB-D object recognition techniques. In the following, we will briefly describe the datasets and compare our method against several state-of-the-art algorithms.

4.1 Washington RGB-D Object Dataset

Washington RGB-D dataset contains 300 household object instances which are organized into 51 categories. Each instance is captured using Kinetic scanner on a revolving turntable from three elevation angles (30$^{\circ }$, 45$^{\circ }$ and 60$^{\circ }$). We follow the experimental setup of Lai et al. [18] in our evaluation and use the same training/ testing splits and the cropped images as suggested by Lai et al. [18]. We then compute LBP features for each image and pass the image to the network for feature learning. Our object recognition results and comparison with state-of-the-art is reported in Table 2. Our proposed technique achieves object recognition accuracy of 89.8% on RGB-D images, the second best performance is achieved by CNN-colourized. Note that our approach achieves superior performance for all the modalities compared to existing RGB-D object recognition methods.

4.2 2D3D Object Dataset

2D3D object dataset contains 16 different categories of highly textured common objects (e.g. drink cartons, computer monitors). We follow the experimental protocol of Browatzki et al. [6] for a fair comparison. Due to the small number of examples, we specifically combine the spoon, knife and fork classes into a joint class of silverware and exclude phone and perforator. This makes a final dataset of 156 instances and 14 classes for category recognition. Our experimental results are reported in Table 3. The proposed approach achieves better performance compared to state-of-the-art methods.

Table 2. Performance comparison in terms of recognition accuracy (in %) of the proposed technique with state-of-the-art methods on Washington RGB-D object dataset. The reported accuracy is an average over 10 trials.

Full size table

Table 3. Performance comparison in terms of recognition accuracy (in %) of the proposed technique with state-of-the-art methods on 2D3D Object Dataset.

Full size table

The superior performance of the proposed network can be attributed to the hierarchical architecture of the deep neural network, which learns translationally invariant and distinctive features in the lower and higher levels of the architecture, respectively.

4.3 Computation/Implementation Details

These experiments were run on high performance computing devices with NVIDIA Titan V GPU and 128 GB RAM. Our code was implemented in MATLAB.

5 Conclusion and Future Directions

In this paper, we proposed a spatial hierarchical analysis deep neural network for RGB-D object recognition. The proposed network consists of CNN and RNNs to learn distinctive features in a hierarchical fashion. The tanslationally invariant features of CNN are analyzed and merged systematically using RNNs to get the most representative and descriptive feature for a given input image. The proposed technique has been tested on two publicly available RGB-D datasets for the task of object recognition. Our proposed deep neural network achieves state-of-the-art performance on these datasets.

In our implementation, our CNN generates a 3D matrix of size $128 \times 64 \times 64$, which is merged to get a final feature vector of size $128 \times 1$. As a future work, we intend to test a 3D matrix of higher dimensions and instead of merging 4 adjacent vectors (as done in this work), we intend to choose a bigger neighbourhood for combining these vectors. This will require more RNNs in the architecture and computational resources. In our technique, we have used sigmoid activation function, however, we believe that recognition performance can be further increased by using ReLU activation function.

References

Asif, U., Bennamoun, M., Sohel, F.: Efficient RGB-D object categorization using cascaded ensembles of randomized decision trees. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1295–1302. IEEE (2015)
Google Scholar
Bai, J., Wu, Y., Zhang, J., Chen, F.: Subset based deep learning for RGB-D object recognition. Neurocomputing 165, 280–292 (2015)
Article Google Scholar
Blum, M., Springenberg, J.T., Wülfing, J., Riedmiller, M.: A learned feature descriptor for object recognition in RGB-D data. In: 2012 IEEE International Conference on Robotics and Automation, pp. 1298–1303. IEEE (2012)
Google Scholar
Bo, L., Ren, X., Fox, D.: Depth kernel descriptors for object recognition. In: 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 821–826. IEEE (2011)
Google Scholar
Bo, L., Ren, X., Fox, D.: Unsupervised feature learning for RGB-D based object recognition. In: Desai, J., Dudek, G., Khatib, O., Kumar, V. (eds.) Experimental Robotics. Springer Tracts in Advanced Robotics, vol. 88, pp. 387–402. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-319-00065-7_27
Chapter Google Scholar
Browatzki, B., Fischer, J., Graf, B., Bülthoff, H.H., Wallraven, C.: Going into depth: evaluating 2D and 3D cues for object classification on a new, large-scale object dataset. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1189–1195. IEEE (2011)
Google Scholar
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)
Cheng, Y., Zhao, X., Huang, K., Tan, T.: Semi-supervised learning for RGB-D object recognition. In: 2014 22nd International Conference on Pattern Recognition, pp. 2377–2382. IEEE (2014)
Google Scholar
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223 (2011)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Google Scholar
Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23
Chapter Google Scholar
Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 447–456 (2015)
Google Scholar
Hu, H., Shah, S.A.A., Bennamoun, M., Molton, M.: 2D and 3D face recognition using convolutional neural network. In: TENCON 2017–2017 IEEE Region 10 Conference, pp. 133–132. IEEE (2017)
Google Scholar
Jhuo, I.-H., Gao, S., Zhuang, L., Lee, D.T., Ma, Y.: Unsupervised feature learning for RGB-D image classification. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9003, pp. 276–289. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16865-4_18
Chapter Google Scholar
Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999)
Article Google Scholar
Khan, S., Rahmani, H., Shah, S.A.A., Bennamoun, M.: A guide to convolutional neural networks for computer vision. Synth. Lect. Comput. Vis. 8(1), 1–207 (2018)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view RGB-D object dataset. In: 2011 IEEE International Conference on Robotics and Automation, pp. 1817–1824. IEEE (2011)
Google Scholar
Le, Q.V., Jaitly, N., Hinton, G.E.: A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941 (2015)
Lee, H., Pham, P., Largman, Y., Ng, A.Y.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Advances in Neural Information Processing Systems, pp. 1096–1104 (2009)
Google Scholar
Liu, L., Shen, C., van den Hengel, A.: The treasure beneath convolutional layers: cross-convolutional-layer pooling for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4749–4757 (2015)
Google Scholar
Liu, W., Ji, R., Li, S.: Towards 3D object detection with bimodal deep Boltzmann machines over RGBD imagery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3013–3021 (2015)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Nadeem, U., Shah, S.A.A., Bennamoun, M., Togneri, R., Sohel, F.: Image set classification for low resolution surveillance. arXiv preprint arXiv:1803.09470 (2018)
Schwarz, M., Schulz, H., Behnke, S.: RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), pp. 1329–1335. IEEE (2015)
Google Scholar
Shah, S., Bennamoun, M., Boussaid, F., El-Sallam, A.: A novel local surface description for automatic 3D object recognition in low resolution cluttered scenes. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 638–643 (2013)
Google Scholar
Shah, S.A., Nadeem, U., Bennamoun, M., Sohel, F., Togneri, R.: Efficient image set classification using linear regression based image reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 99–108 (2017)
Google Scholar
Shah, S.A.A., Bennamoun, M., Boussaid, F.: Performance evaluation of 3D local surface descriptors for low and high resolution range image registration. In: 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. IEEE (2014)
Google Scholar
Shah, S.A.A., Bennamoun, M., Boussaid, F.: A novel 3D vorticity based approach for automatic registration of low resolution range images. Pattern Recogn. 48(9), 2859–2871 (2015)
Article Google Scholar
Shah, S.A.A., Bennamoun, M., Boussaid, F.: Iterative deep learning for image set based face and object recognition. Neurocomputing 174, 866–874 (2016)
Article Google Scholar
Shah, S.A.A., Bennamoun, M., Boussaid, F.: A novel feature representation for automatic 3D object recognition in cluttered scenes. Neurocomputing 205, 1–15 (2016)
Article Google Scholar
Shah, S.A.A., Bennamoun, M., Boussaid, F.: Keypoints-based surface representation for 3D modeling and 3D object recognition. Pattern Recogn. 64, 29–38 (2017)
Article Google Scholar
Shah, S.A.A., Bennamoun, M., Boussaid, F., El-Sallam, A.A.: 3D-div: a novel local surface descriptor for feature matching and pairwise range image registration. In: 2013 IEEE International Conference on Image Processing, pp. 2934–2938. IEEE (2013)
Google Scholar
Shah, S.A.A., Bennamoun, M., Boussaid, F., El-Sallam, A.A.: Automatic object detection using objectness measure. In: 2013 1st International Conference on Communications, Signal Processing, and Their Applications (ICCSPA), pp. 1–6. IEEE (2013)
Google Scholar
Shah, S.A.A., Bennamoun, M., Boussaid, F., While, L.: Evolutionary feature learning for 3-D object recognition. IEEE Access 6, 2434–2444 (2017)
Article Google Scholar
Shah, S.A.A., Bennamoun, M., Molton, M.: A fully automatic framework for prediction of 3D facial rejuvenation. In: 2018 International Conference on Image and Vision Computing New Zealand (IVCNZ), pp. 1–6. IEEE (2018)
Google Scholar
Shah, S.A.A., Bennamoun, M., Molton, M.K.: Machine learning approaches for prediction of facial rejuvenation using real and synthetic data. IEEE Access 7, 23779–23787 (2019)
Article Google Scholar
Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)
Google Scholar
Zaki, H.F., Shafait, F., Mian, A.: Localized deep extreme learning machines for efficient RGB-D object recognition. In: 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8. IEEE (2015)
Google Scholar

Download references

Acknowledgment

The author would like to thank NVIDIA for their Titan-V GPU donation.

Author information

Authors and Affiliations

Discipline of Information Technology, Mathemathics and Statistics, Murdoch University, Perth, Australia
Syed Afaq Ali Shah

Authors

Syed Afaq Ali Shah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Syed Afaq Ali Shah .

Editor information

Editors and Affiliations

CSIRO, St. Lucia, QLD, Australia
Joel Janek Dabrowski
CSIRO, Sandy Bay, TAS, Australia
Ashfaqur Rahman
Charles Sturt University, Bathurst, NSW, Australia
Manoranjan Paul

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shah, S.A.A. (2020). Spatial Hierarchical Analysis Deep Neural Network for RGB-D Object Recognition. In: Dabrowski, J., Rahman, A., Paul, M. (eds) Image and Video Technology. PSIVT 2019. Lecture Notes in Computer Science(), vol 11994. Springer, Cham. https://doi.org/10.1007/978-3-030-39770-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-39770-8_15
Published: 27 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39769-2
Online ISBN: 978-3-030-39770-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spatial Hierarchical Analysis Deep Neural Network for RGB-D Object Recognition

Abstract

Similar content being viewed by others

Exploiting Multi-layer Features Using a CNN-RNN Approach for RGB-D Object Recognition

3D Object Recognition Using Convolutional Neural Networks with Transfer Learning Between Input Channels

RGB-D Scene Classification via Multi-modal Feature Learning

Keywords

1 Introduction

2 Related Work

3 Proposed Spatial Hierarchical Analysis Deep Neural Network

3.1 Network Initialization and Training

4 Experimental Results

4.1 Washington RGB-D Object Dataset

4.2 2D3D Object Dataset

4.3 Computation/Implementation Details

5 Conclusion and Future Directions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Spatial Hierarchical Analysis Deep Neural Network for RGB-D Object Recognition

Abstract

Similar content being viewed by others

Exploiting Multi-layer Features Using a CNN-RNN Approach for RGB-D Object Recognition

3D Object Recognition Using Convolutional Neural Networks with Transfer Learning Between Input Channels

RGB-D Scene Classification via Multi-modal Feature Learning

Keywords

1 Introduction

2 Related Work

3 Proposed Spatial Hierarchical Analysis Deep Neural Network

3.1 Network Initialization and Training

4 Experimental Results

4.1 Washington RGB-D Object Dataset

4.2 2D3D Object Dataset

4.3 Computation/Implementation Details

5 Conclusion and Future Directions

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation