Bird Image Retrieval and Recognition Using A Deep Learning Platform
Bird Image Retrieval and Recognition Using A Deep Learning Platform
Bird Image Retrieval and Recognition Using A Deep Learning Platform
June 4, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2918274
ABSTRACT Birdwatching is a common hobby but to identify their species requires the assistance of bird
books. To provide birdwatchers a handy tool to admire the beauty of birds, we developed a deep learning
platform to assist users in recognizing 27 species of birds endemic to Taiwan using a mobile app named
the Internet of Birds (IoB). Bird images were learned by a convolutional neural network (CNN) to localize
prominent features in the images. First, we established and generated a bounded region of interest to refine
the shapes and colors of the object granularities and subsequently balanced the distribution of bird species.
Then, a skip connection method was used to linearly combine the outputs of the previous and current layers
to improve feature extraction. Finally, we applied the softmax function to obtain a probability distribution
of bird features. The learned parameters of bird features were used to identify pictures uploaded by mobile
users. The proposed CNN model with skip connections achieved higher accuracy of 99.00 % compared
with the 93.98% from a CNN and 89.00% from the SVM for the training images. As for the test dataset,
the average sensitivity, specificity, and accuracy were 93.79%, 96.11%, and 95.37%, respectively.
INDEX TERMS Bird image recognition, convolutional neural network, deep learning, mobile app.
Ultimately, information obtained from a bird image uploaded digital cameras in smartphones are the most pervasive tools
by an end-user, captured using a mobile camera, can be used for recognizing the salient features of physical objects,
navigated through the client–server architecture to retrieve enabling users to detect, identify objects and share related
information and predict bird species from the trained model knowledge. Birds present in a flock are often deeply colorful;
stored on the server. This process facilitates efficient correla- therefore, identification at a glance is challenging for both
tion of fine-grained object parts and autonomous bird identifi- birdwatchers and onlookers because of birds’ ambiguous
cation from captured images and can contribute considerable, semantic features [19]. To address this problem, an infor-
valuable information regarding bird species. mation retrieval model for birdwatching has been proposed
The remainder of this paper is organized as follows. that uses deep neural networks to localize and clearly
Section II briefly reviews related approaches for fine-grained describe bird features with the aid of an Android smart-
visual categorization. Section III describes the various types phone [20], [21].
of dataset used for feature extraction. Section IV focuses
on the deep learning model and its features used in object III. DATA ACQUISITION
part models, and describes the correlation between part Feature extraction is vital to the classification of relevant
localization and fine-grained feature extraction. Section IV information and the differentiation of bird species. We com-
also describes various correlation requirements, such as data bined bird data from the Internet of Birds (IoB) and an
augmentation, for excellent performance, localization, seg- Internet bird dataset to learn the bird species.
mentation, identification of subcategories, as well as the
requirement of a classifier for effective object prediction.
The experimental results and analysis of the datasets are
presented in Section V. Section VI summarize the discussion
and limitation part of the study. Conclusions and directions
for future study are provided in Section VII.
E. SYSTEM IMPLEMENTATION
In this subsection, we explain using a high-resolution smart-
phone camera to identify and classify bird information [40]
based on deep learning. To complete the semantic bird search
D. FEATURE EXTRACTION task, we established a client–server architecture to bridge the
Extracting features from raw input images is the primary task communication gap between the cloud and mobile device
when extracting relevant and descriptive information for fine- over a network. The entire setup was executed in the follow-
grained object recognition [36]–[38]. However, because of ing manner:
semantic and intraclass variance, feature extraction remains • Raw bird images were distilled to remove irrelevant
challenging. We separately extracted the features in relevant parts and learned by the CNN to yield parameters on the
positions for each part of an image and subsequently learned GPU platform. Subsequently, a TF inference model [41]
the parts of the model features that were mapped directly to was developed in the workstation for deployment in the
the corresponding parts. The features were calculated using smartphone.
ReLU 5 and ReLU 6. Localization was used to find object • The output was detected using an Android app
parts defined by bounding box coordinates and their dimen- platform or through the web.
sions (width and height) in the image [39]. For the localiza- On the workstation/server side, the following segments
tion task an intersection over union score >0.5 was set for were considered. The TF backend session model for object
our model. An FC layer with a ReLU was used to predict the detection was prepared to save the TF computation graphs
location of bounding box Bx . Subsequent steps of the learning of input, output, weight, and bias as graph_def text files
algorithm were for learning the map of the feature vectors (tfdroid.pbtxt), which comprised the entire architecture of the
of the input image, deciding whether the region fit an object model. The CNN architecture was trained to load the raw
class of interest, and then classifying the expected output with input data of bird images using Keras [42] callbacks with
the correct labels in the image. For a given image, feature the predefined parameters into TF format to fit the model
vectors represent the probability of target object centrality in for inference. After training the model, the parameters of all
FIGURE 10. Performance comparison of the three models for the training
dataset.
The proposed architecture encountered some limitations [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classifica-
and has room for improvement in the future. Sometime tion with deep convolutional neural networks,’’ in Proc. 25th Int. Conf.
Advance Neural Inf. Process. Syst., Lake Tahoe, NV, USA, Dec. 2012,
the model confused the prediction of endemic birds when pp. 1097–1105.
the uploaded bird images shared similar colors and size. [10] B. Zhao, J. Feng, X. Wu, and S. Yan ‘‘A survey on deep learning-based fine-
If most bird species within a district need to be retrieved grained object classification and semantic segmentation,’’ Int. J. Automat.
Comput., vol. 14, no. 2, pp. 119–135, Apr. 2017.
from the system, the database must be updated and need to
[11] H. Yang, J. T. Zhou, Y. Zhang, B.-B. Gao, J. Wu, and J. Cai, ‘‘Exploit
be retrained with new features of the birds. For extending bounding box annotations for multi-label object recognition,’’ in Proc. Int.
the proposed system to some specific districts for birdwatch- Conf. Comput. Vision Pattern Recognit., Las Vegas, NV, USA, Jun. 2016,
pp. 280–288.
ing may encounter imbalanced distribution of the dataset
[12] Li Liu, W. Ouyang, X. Wang, P. Fieguth, X. Liu, and M. Pietikäinen,
among the bird species if only a small size of dataset is ‘‘Deep learning for generic object detection: A survey,’’ Sep. 2018,
available. arXiv:1809.02165. [Online]. Available: https://arxiv.org/abs/1809.02165
In the future, we intend to develop a method for predicting [13] K. Dhindsa, K. D. Gauder, K. A. Marszalek, B. Terpou, and S. Becker,
‘‘Progressive thresholding: Shaping and specificity in automated neuro-
different generations of specific bird species within the intr- feedback training,’’ IEEE IEEE Trans. Neural Syst. Rehabil. Eng., vol. 26,
aclass and interclass variations of birds and to expand bird no. 12, pp. 2297–2305, Dec. 2018.
species to our database so that more people can admire the [14] C.-Y. Lee, A. Bhardwaj, W. Di, V. Jagadeesh, and R. Piramuthu, ‘‘Region-
beauty from watching birds. based discriminative feature pooling for scene text recognition,’’ in Proc.
Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 4050–4057.
[15] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, ‘‘Simultaneous
VII. CONCLUSIONS detection and segmentation,’’ in Proc. Eur. Conf. Comput. Vis., Jul. 2014,
This study developed a mobile app platform that uses cloud- pp. 297–312.
based deep learning for image processing to identify bird [16] S. Branson, G. V. Horn, S. Belongie, and P. Perona, ‘‘Bird species cate-
species from digital images uploaded by an end-user on a gorization using pose normalized deep convolutional nets,’’ in. Proc. Brit.
Mach. Vis. Conf., Nottingham, U.K., Jun. 2014, pp. 1–14.
smartphone. This study dealt predominantly with bird recog- [17] B. Yao, A. Khosla, and L. Fei-Fei, ‘‘Combining randomization and
nition of 27 Taiwan endemic bird species. The proposed sys- discrimination for fine-grained image categorization,’’ in Proc. CVPR,
tem could detect and differentiate uploaded images as birds Colorado Springs, CO, USA, USA, Jun. 2011, pp. 1577–1584.
[18] Y.-B. Lin, Y.-W. Lin, C.-M. Huang, C.-Y. Chih, and P. Lin, ‘‘IoTtalk:
with an overall accuracy of 98.70% for the training dataset. A management platform for reconfigurable sensor devices,’’ IEEE Internet
This study ultimately aimed to design an automatic system Things J., vol. 4, no. 5, pp. 1552–1562, Oct. 2017.
for differentiating fine-grained objects among bird images [19] X. Zhang, H. Xiong, W. Zhou, and Q. Tian, ‘‘Fused one-vs-all features with
with shared fundamental characteristics but minor variations semantic alignments for fine-grained visual categorization,’’ IEEE Trans.
Image Process., vol. 25, no. 2, pp. 878–892, Feb. 2016.
in appearance. [20] Google Android Developer. Render Script API Guides. Accessed:
In the future, we intend to develop a method for predicting Nov. 20, 2018. [Online]. Available: https://developer.android.com/guide/
different generations of specific bird species within the intr- topics/renderscript/compute.html
[21] M. Z. Andrew and G. Howard. (Jun. 14, 2017), MobileNets: Open-
aclass and interclass variations of birds and to add more bird Source Models for Efficient On-Device Vision. Accessed: Feb. 20, 2018.
species to our database. [Online]. Available: https://research.googleblog.com/2017/06/mobilenets-
open-source models-for.html
REFERENCES [22] C. Koylu, C. Zhao, and W. Shao, ‘‘Deep neural networks and kernel density
[1] D. T. C. Cox and K. J. Gaston, ‘‘Likeability of garden birds: Importance of estimation for detecting human activity patterns from Geo-tagged images:
species knowledge & richness in connecting people to nature,’’ PloS one, A case study of birdwatching on Flickr,’’ Int. J. Geo-Inf., vol. 8, no. 1, p. 45,
vol. 10, no. 11, Nov. 2015, Art. no. e0141505. Jan. 2019.
[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, [23] R. Richa, R. Linhares, E. Comunello, A. von Wangenheim, J.-Y. Schnitzler,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and B. Wassmer, C. Guillemot, G. Thuret, P. Gain, G. Hager, and R. Taylor,
L. Fei-Fei, ‘‘ImageNet large scale visual recognition challenge,’’ Int. J. ‘‘Fundus image mosaicking for information augmentation in computer-
Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. assisted slit-lamp imaging,’’ IEEE Trans. Med. Imag., vol. 33, no. 6,
[3] H. Yao, S. Zhang, Y. Zhang, J. Li, and Q. Tian, ‘‘Coarse-to-fine descrip- pp. 1304–1312, Jun. 2014.
tion for fine-grained visual categorization,’’ IEEE Trans. Image Process., [24] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, ‘‘Deep con-
vol. 25, no. 10, pp. 4858–4872, Oct. 2016. volutional neural network architecture with reconfigurable computation
[4] F. Garcia, J. Cervantes, A. Lopez, and M. Alvarado, ‘‘Fruit classification patterns,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 8,
by extracting color chromaticity, shape and texture features: Towards an pp. 2220–2233, Aug. 2017.
application for supermarkets,’’ IEEE Latin Amer. Trans., vol. 14, no. 7, [25] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. O. Ogunbona, ‘‘Action
pp. 3434–3443, Jul. 2016. recognition from depth maps using deep convolutional neural networks,’’
[5] L. Zhu, J. Shen, H. Jin, L. Xie, and R. Zheng, ‘‘Landmark classification IEEE Trans. Human–Mach. Syst., vol. 46, no. 4, pp. 498–509, Aug. 2016.
with hierarchical multi-modal exemplar feature,’’ IEEE Trans. Multimedia, [26] B. Du, W. Xiong, J. Wu, L. Zhang, L. Zhang, and D. Tao, ‘‘Stacked
vol. 17, no. 7, pp. 981–993, Jul. 2015. convolutional denoising auto-encoders for feature representation,’’ IEEE
[6] X. Liang, L. Lin, W. Yang, P. Luo, J. Huang, and S. Yan, ‘‘Clothes Trans. Cybern., vol. 47, no. 4, pp. 1017–1027, Apr. 2017.
co-parsing via joint image segmentation and labeling with applica- [27] L. Yang, A. M. MacEachren, P. Mitra, and T. Onorati, ‘‘Visually-enabled
tion to clothing retrieval,’’ IEEE Trans. Multimedia, vol. 18, no. 6, active deep learning for (Geo) text and image classification: A review,’’ Int.
pp. 1175–1186, Jun. 2016. J. Geo-Inf., vol. 7, no. 2, p. 65, Feb. 2018.
[7] Y.-P. Huang, L. Sithole, and T.-T. Lee, ‘‘Structure from motion technique [28] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato,
for scene detection using autonomous drone navigation,’’ IEEE Trans. A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, ‘‘Large scale
Syst., Man, Cybern., Syst., to be published. distributed deep networks,’’ in Proc. 25th Int. Conf. Adv. Neural Inf.
[8] C. McCool, I. Sa, F. Dayoub, C. Lehnert, T. Perez, and B. Upcroft, Process. Syst., Lake Tahoe, NV, USA, Dec. 2012, pp. 1223–1231.
‘‘Visual detection of occluded crop: For automated harvesting,’’ in [29] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, ‘‘How transferable are
Proc. Int. Conf. Robot. Autom. (ICRA), Stockholm, Sweden, May 2016, features in deep neural networks,’’ in Proc. Int. Conf. Advance Neural Inf.
pp. 2506–2512. Process. Syst., Dec. 2014, pp. 3320–3328.
[30] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image [45] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan,
recognition,’’ in Proc. Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2016, ‘‘HCP: A flexible CNN framework for multi-label image classification,’’
pp. 770–778. IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 9, pp. 1901–1907,
[31] Nvidia. GEFORCE GTX 1080Ti. Accessed: Jan. 25, 2018. [Online]. Avail- Sep. 2016.
able: https://www.nvidia.com/en-us/geforce/products/10series/geforce- [46] K.-J. Hsu, Y.-Y. Lin, and Y.-Y. Chuang, ‘‘Augmented multiple instance
gtx-1080-ti/ regression for inferring object contours in bounding boxes,’’ IEEE Trans.
[32] Intel. Intel Xeon Phi Product Family. Accessed: Jan. 2, 2018. Image Process., vol. 23, no. 4, pp. 1722–1736, Apr. 2014.
[Online]. Available: https://www.intel.com/content/www/us/en/products/
processors/xeon-phi.html?cid=sem43700027892951748&intel_term=
intel+xeon&gclid=CjwKCAiAxuTQBRBmEiwAAkFF1ohAPPZlb3p
EhujFDN_w9cgzqp4lPeGrui6WbsXSyW3rApIspzkhKhoCbu4QAvD_ YO-PING HUANG (S’88–M’92–SM’04) received
BwE&gclsrc=aw.ds the Ph.D. degree in electrical engineering from
[33] Nvidia. GPUs are Driving Energy Efficiency Across the Computing Indus- Texas Tech University, Lubbock, TX, USA.
try, From Phones to Supercomputers. Accessed: Nov. 25, 2018. [Online]. He was a Professor and the Dean of Research
Available: http://www.nvidia.com/object/gcr-energy-efficiency.html and Development (2005–2007), the Dean of the
[34] H. Yao, D. Zhang, J. Li, J. Zhou, S. Zhang, and Y. Zhang, ‘‘DSP: Discrimi- College of Electrical Engineering and Computer
native spatial part modeling for fine-grained visual categorization,’’ Image Science (2002–2005), and the Department Chair
Vis. Comput., vol. 63, pp. 24–37, Jul. 2017. (2000–2002) of Tatung University, Taipei. He is
[35] Stanford. CS231n.2017: Convolutional Neural Networks for Visual Recog- currently a Professor with the Department of Elec-
nition. Accessed: Oct. 25, 2017. [Online]. Available: http://cs231n.github. trical Engineering, National Taipei University of
io/convolutional-networks/ Technology, Taipei, Taiwan, where he has served as the Secretary-General
[36] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, ‘‘Part-based R-CNNs (2008–2011). His current research interests include fuzzy systems design
for fine-grained category detection,’’ in Proc. Int. Conf. Eur. Conf. Comput. and modeling, deep learning modeling, intelligent control, medical data
Vis., Cham, Switzerland, Jul. 2014, pp. 834–849.
mining, and rehabilitation systems design. He is an IET Fellow (2008) and
[37] L. Xie, J. Wang, B. Zhang, and Q. Tian, ‘‘Fine-grained image search,’’
an International Association of Grey System and Uncertain Analysis Fellow
IEEE Trans. Multimedia, vol. 17, no. 5, pp. 636–647, May 2015.
[38] C. Huang, Z. He, G. Cao, and W. Cao, ‘‘Task-driven progressive part (2016). He serves as the President of the Taiwan Association of Systems
localization for fine-grained object recognition,’’ IEEE Trans. Multimedia, Science and Engineering, the IEEE SMCS BoG, the Chair of the IEEE
vol. 18, no. 12, pp. 2372–2383, Dec. 2016. SMCS Technical Committee on Intelligent Transportation Systems, and the
[39] D. Lin, X. Shen, C. Lu, and J. Jia, ‘‘Deep LAC: Deep localization, Chair of the Taiwan SIGSPATIAL ACM Chapter. He was the Chair of the
alignment and classification for fine-grained recognition,’’ in Proc. Int. IEEE SMCS Taipei Chapter, the Chair of the IEEE CIS Taipei Chapter, and
Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, Jun. 2015, the CEO of the Joint Commission of Technological and Vocational College
pp. 1666–1674. Admission Committee, Taiwan (2011–2015).
[40] Y.-P. Huang and T. Tsai, ‘‘A fuzzy semantic approach to retrieving bird
information using handheld devices,’’ IEEE Intell. Syst., vol. 20, no. 1,
pp. 16–23, Jan./Feb. 2005. HAOBIJAM BASANTA received the M.C.A.
[41] TensorFlow. Building TensorFlow on Android. Accessed: Sep. 20, 2017.
degree from the University of Jamia Millia
[Online]. Available: https://www.tensorflow.org/mobile/android_build
Islamia, New Delhi, India. He is currently pursu-
[42] Keras, Keras: The Python Deep Learning library. Accessed: Sep. 25, 2017.
[Online]. Available: https://keras.io/ ing the Ph.D. degree in electrical engineering and
[43] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and computer science with the National Taipei Uni-
A. W. M. Smeulders, ‘‘Selective search for object recognition,’’ Int. versity of Technology, Taipei, Taiwan. His current
J. Comput. Vis., vol. 104, no. 2, pp. 154–171, Apr. 2013. research interests include the Internet of Things
[44] H. Zheng, Y. Huang, H. Ling, Q. Zou, and H. Yang, ‘‘Accurate segmen- (IoT) for the elderly healthcare systems, big data
tation for infrared flying bird tracking,’’ Chin. J. Electron., vol. 25, no. 4, analytics, deep learning, and image processing.
pp. 625–631, Jul. 2016.