Bird Image Retrieval and Recognition Using A Deep Learning Platform

Received May 2, 2019, accepted May 19, 2019, date of publication May 22, 2019, date of current version
June 4, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2918274
Bird Image Retrieval and Recognition Using

a Deep Learning Platform
YO-PING HUANG 1,2 , (Senior Member, IEEE), AND HAOBIJAM BASANTA1
1 Department of Electrical Engineering, National Taipei University of Technology, Taipei 10608, Taiwan
2 Department of Computer Science and Information Engineering, National Taipei University, New Taipei City 23741, Taiwan
Corresponding author: Yo-Ping Huang (yphuang@ntut.edu.tw)

This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST108-2321-B-027-001 and
Grant MOST107-2221-E-027-113.
ABSTRACT Birdwatching is a common hobby but to identify their species requires the assistance of bird
books. To provide birdwatchers a handy tool to admire the beauty of birds, we developed a deep learning
platform to assist users in recognizing 27 species of birds endemic to Taiwan using a mobile app named
the Internet of Birds (IoB). Bird images were learned by a convolutional neural network (CNN) to localize
prominent features in the images. First, we established and generated a bounded region of interest to refine
the shapes and colors of the object granularities and subsequently balanced the distribution of bird species.
Then, a skip connection method was used to linearly combine the outputs of the previous and current layers
to improve feature extraction. Finally, we applied the softmax function to obtain a probability distribution
of bird features. The learned parameters of bird features were used to identify pictures uploaded by mobile
users. The proposed CNN model with skip connections achieved higher accuracy of 99.00 % compared
with the 93.98% from a CNN and 89.00% from the SVM for the training images. As for the test dataset,
the average sensitivity, specificity, and accuracy were 93.79%, 96.11%, and 95.37%, respectively.
INDEX TERMS Bird image recognition, convolutional neural network, deep learning, mobile app.
I. INTRODUCTION objects, including vegetables and fruits [4], landmarks [5],

The everyday pace of life tends to be fast and frantic and clothing [6], cars [7], plants [8], and birds [9], within a
involves extramural activities. Birdwatching is a recreational particular cluster of scenes. However, considerable room for
activity that can provide relaxation in daily life and promote improvement remains in the accuracy and feasibility of bird
resilience to face daily challenges. It can also offer health feature extraction techniques. Detection of object parts is
benefits and happiness derived from enjoying nature [1]. challenging because of complex variations or similar subor-
Numerous people visit bird sanctuaries to glance at the dinate categories and fringes of objects. Intraclass and inter-
various bird species or to praise their elegant and beautiful class variation in the silhouettes and appearances of birds
feathers while barely recognizing the differences between is difficult to identify correctly because certain features are
bird species and their features. Understanding such differ- shared among species.
ences between species can enhance our knowledge of exotic To classify the aesthetics of birds in their natural habitats,
birds as well as their ecosystems and biodiversity. However, this study developed a method using a convolutional neu-
because of observer constraints such as location, distance, ral network (CNN) to extract information from bird images
and equipment, identifying birds with the naked eye is based captured previously or in real time by identifying local fea-
on basic characteristic features, and appropriate classification tures. First, raw input data of myriad semantic parts of a
based on distinct features is often seen as tedious. In the past, bird were gathered and localized. Second, the feature vec-
computer vision [2], [3] and its subcategory of recognition, tors of each generic part were detected and filtered based
which use techniques such as machine learning, have been on shape, size, and color. Third, a CNN model was trained
extensively researched to delineate the specific features of with the bird pictures in a graphics processing unit (GPU)
for feature vector extraction with consideration of the afore-
The associate editor coordinating the review of this manuscript and mentioned characteristics, and subsequently the classified,
approving it for publication was Biju Issac. trained data were stored on a server to identify a target object.
2169-3536
2019 IEEE. Translations and content mining are permitted for academic research only.
66980 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 7, 2019
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Y.-P. Huang, H. Basanta: Bird Image Retrieval and Recognition Using a Deep Learning Platform
Ultimately, information obtained from a bird image uploaded digital cameras in smartphones are the most pervasive tools
by an end-user, captured using a mobile camera, can be used for recognizing the salient features of physical objects,
navigated through the client–server architecture to retrieve enabling users to detect, identify objects and share related
information and predict bird species from the trained model knowledge. Birds present in a flock are often deeply colorful;
stored on the server. This process facilitates efficient correla- therefore, identification at a glance is challenging for both
tion of fine-grained object parts and autonomous bird identifi- birdwatchers and onlookers because of birds’ ambiguous
cation from captured images and can contribute considerable, semantic features [19]. To address this problem, an infor-
valuable information regarding bird species. mation retrieval model for birdwatching has been proposed
The remainder of this paper is organized as follows. that uses deep neural networks to localize and clearly
Section II briefly reviews related approaches for fine-grained describe bird features with the aid of an Android smart-
visual categorization. Section III describes the various types phone [20], [21].
of dataset used for feature extraction. Section IV focuses
on the deep learning model and its features used in object III. DATA ACQUISITION
part models, and describes the correlation between part Feature extraction is vital to the classification of relevant
localization and fine-grained feature extraction. Section IV information and the differentiation of bird species. We com-
also describes various correlation requirements, such as data bined bird data from the Internet of Birds (IoB) and an
augmentation, for excellent performance, localization, seg- Internet bird dataset to learn the bird species.
mentation, identification of subcategories, as well as the
requirement of a classifier for effective object prediction.
The experimental results and analysis of the datasets are
presented in Section V. Section VI summarize the discussion
and limitation part of the study. Conclusions and directions
for future study are provided in Section VII.
II. RELATED WORK

Recently, some fine-grained visual categorizations meth-
ods have been proposed for species identification, and
they have become a promising approach within computer
vision research, with applications in numerous domains [10].
Numerous fine-grained recognition datasets, such as Ima-
geNet, ILSVRC, Caltech-256, and CUB 200, have trained
models with a wide variety of data to extract global fea- FIGURE 1. IoB interface.
tures such as colors, textures, and shapes from multilabel

objects [11]. Many approaches have been applied for generic
object recognition [12]. Some methods apply local part learn- A. IOB
ing that uses deformable part models and region-CNN for The IoB is a crowdsourced metasearch-engine database
object detection [13], generation of a bounding box, and specifically for birds, where any individual can store bird
selection of distinctive parts for image recognition. Some images and instantly retrieve information about the birds
studies have focused on discriminative features based on therein. Uploaded bird images are identified from extracted
the local traits of birds [14], [16]. Simultaneous detection features. This platform encourages individuals to become
and segmentation are used to localize score detections effec- involved in birdwatching and to enrich their knowledge of
tively [15]. Pose-normalization and model ensembles [16] various bird species. The IoB is available online for free (with
are also used to improve the performance of fine-grained keyword: Who Cares? Keep Walking). Fig. 1 shows the app
detection by generating millions of key point pairs through interface. Because a fall detection module is embedded in the
fully convolutional search. Discriminative image patches and system, the app also serves as a wellness platform to assist
randomization techniques are integrated to distinguish classes individuals in staying safe while birdwatching. In addition,
of images and prevent overfitting [17]. The present work the system can track the distance individuals cover from their
also approached the learning of discriminative image fea- daily physical strides using a pedometer to promote fitness
tures using a CNN architecture for fine-grained recognition. and motivate users to walk while birdwatching [22].
However, a complementary approach using domain knowl-
edge of general bird features was integrated to provide B. INTERNET BIRD IMAGES
detailed information about the predicted bird. A pool of images is required for deep learning of subcatego-
The advancement of consumer products, such as smart- rization. Bird images containing 27 bird species endemic to
phones, digital cameras, and wearable gadgets [18], has trans- Taiwan on various backgrounds were compiled from the IoB
formed multidisciplinary approaches toward technology by and several other online resources. The use of public-domain
connecting the physical and digital worlds. High-resolution images has benefits and drawbacks. Although Internet
VOLUME 7, 2019 66981

image sources add diversity to the dataset, the images may

be contaminated with noise, harshness, spurious pixels, and
blurred parts, all of which degrade image quality. Therefore,
to limit the intensity of deformity in an assortment of images,
high-pixel images with clear boundaries were used. Finally,
to obtain standardized balance in the dataset, the bird species
images were transformed and augmented as follows [23]:
• Random flipping: Images were horizontally and verti-
cally flipped.
• Rotation: Images were randomly rotated (maximum
angle of 25◦ ) for training.
• Translation: Images were randomly shifted
−10 to 10pixels.
• Zero-phase component analysis whitening: Dimension
FIGURE 2. Feature extraction paradigm for bird images.
and redundancy in the matrix of pixel images were
decreased.
• Gaussian filtering: Images were blurred for effective TABLE 1. List of terms and abbreviations.
smoothing of noise.
In deep learning algorithms, feature extraction is a gener-
alization step to differentiate the learning categories of input
data patterns.
Object recognition with a high-level feature extraction
architecture comprises the following steps: (1) data content
analysis, in which all generic raw data are preprocessed
to extract nonlinear transformations and to fit the parame-
ters into a machine learning model for feature extraction;
(2) optimal probabilities of relevant structural information
from each tuned parameter are clubbed into a new array of
classifiers; and (3) a prediction is made based on trained and
learned parameters. To extract multiple feature levels from
raw data and evaluate the performance of the CNN for the
dataset [24]–[26], the dataset was split into the three modules
discussed as follows. (1) The training dataset comprised raw
data samples that were incorporated into the training model to
determine specific feature parameters, perform correlational
tasks, and create a related classification model. (2) The vali-
dation dataset was used to tune hyperparameters of the trained
model to minimize overfitting and validate performance. The
model regularizes early stopping to prevent overfitting and to
enhance learning when the precision of the training dataset
increases while the error of the validation dataset remains artificial intelligence [28], outperforming traditional image
the same or decreases. (3) The test dataset was used to test classification algorithms, and they are currently being down-
the classifier parameters and assess the performance of the scaled for feasible mobile implementation. The proposed
actual prediction of the network model. Once the features had deep learning model for bird image classification using the
been extracted from the raw data, the trained prediction model CNN framework is described as follows.
was deployed to classify new input images. Fig. 2 shows the
module for extracting unique features of birds with the CNN A. CNN ARCHITECTURE
and predicting the most classified labels for the input images. The model of CNN configuration for bird identification uti-
Table 1 provides a list of terms and related abbreviations lized a stack of convolutional layers comprising an input
commonly used in this study. layer, two FC layers, and one final output softmax layer.
Each convolutional layer comprised (a) 5 × 5 convolution,
IV. PROPOSED DEEP LEARNING MODEL (b) BN, (c) ReLU activation, and (d) pooling layers. This
The emergence of deep learning [27] algorithms has resulted section explains how to construct an optimized CNN model,
in highly complex cognitive tasks for computer vision and why the parameters and hyperparameters must be tuned
image recognition. Recently, deep learning models have before training, the total number of convolutional layers, the
become the most popular tool for big data analysis and size of the kernels for all relative convolutional layers, and
66982 VOLUME 7, 2019

of 27 bird species. The dataset was split into 2280 images

for training, 570 for validation, and 713 for testing. The input
images passed through a hierarchical stack of convolutional
layers to extract distinct features, such as color, shape, and
edges, with varying orientations of the head, body, legs, and
tail shown in the images. The first convolutional layer trans-
formed the input image into pixels, propelled it to the next
layer, and followed the feature extraction procedure until the
input image had been precisely classified with a probability
distribution. To capture the features of the input image, every
convolutional filter had a kernel size of 3 × 3 pixels and a
high activation map that slid across the entire input volume.
The stride was fixed at one by shifting the kernel one unit
at a time to control the filter convolving around the input of
FIGURE 3. Framework of skip connections.
the next pixel so that the output volume would not shrink and
the yield would be an integer rather than a fraction; that is,
(i − k + 2q)/(s + 1), where i is the input height or length, k is
the likelihood of retaining the anode during dropout regular- the filter,q is the padding, and s is the stride. The padding was
ization for the dataset. attuned to one around the input image to preserve the spatial
resolution of output feature map after convolution; that is,
B. SKIP CONNECTIONS q = (k − 1)/2. Spatial pooling [34] was implemented
When images are learned, deep neural network models train a to localize and separate the chunks of images with a
base network from scratch to identify associations of features 2 × 2 pixel window size, max pooling, and two strides, where
and patterns in the target dataset. Features are transformed the maximum pixel rate in each chunk of an image was con-
from general to specific by the last layer of the network to sidered. The stack of convolutional layers was followed by
predict the outputs of newly imposed inputs. If first-layer an element-wise activation function, the ReLU, to maintain a
features are general and last-layer features are specific, then constant volume throughout the network.
a transition from general to specific must have occurred To implement the skip connection in the network, down-
somewhere in the network [29]. To address and quantify the sampling is performed by conv3 and conv4 with a stride
degree to which a particular layer is general or specific, we of 2. We directly use skip connection when the input and
proposed adding skip connections [30] among corresponding output have the same dimensions. When the dimensions
convolutional layers, as shown in Fig. 3. The skip layer con- of the output are increased, the shortcut performs identity
nections should improve feature extraction through weighted mapping with an extra zero-padding entry for increasing
summation of corresponding layers as follows: dimensions. Two FC layers were implemented with the same
4096-dimension configuration to learn the gradient descent,
G(X ) = (1 − α)F(X ) + αX (1) compute the target class scores in the training set for each
where X is the input, F(X ) is a function of input X , G(X ) is image, and localize objects positioned anywhere in the input
a linear combination of F(X ) and X , and α is a weight in the image. A schematic of the ConvNet architecture is presented
unit interval [0,1]. To check specific layers, we used different in Fig. 4, and the parameter configuration for ConvNet is
weights. For example, if α > 0.5, then result from the previ- provided in Table 2.
ous layer contributes less to overall performance than the lay- After the FC layers were added, the n-softmax layer acti-
ers preceding it. By contrast, if α < 0.5, then result from the vation function [35] was added; here, n is the number of
previous layer contributes more to the overall performance. bird categories. The softmax layer yields a probabilistic inter-
Using these skip connections can facilitate network training pretation of multiple classes. Each label corresponds to the
by reducing memory usage and increasing performance by likelihood that the input images are correctly classified using
concatenating the feature maps of each convolution layer. vector-to-vector transformation, thereby minimizing cross-
entropy loss.
C. TRAINING OF THE BIRD DATASET
exi
Learning of bird species by the CNN was implemented on softmax σ (x)i = , for i = 1, . . . , K . (2)
K
a GPU workstation with a 12 Intel Xeon CPU, 32 GB of P
e x j
memory, and an Nvidia GeForce 2 11 GB GTX 1080 Ti j=1
graphics card on a TF platform [31]–[33]. During training,
σ (x)i = 1
P
where xi is the ith element of the input vector x,
input color images with a fixed size of 112 × 112 pixels were i
fed into CNN for feature extraction and bird image recog- and σ (x)i > 0, which is the probability distribution over a set
nition. This study uses a dataset comprising 3563 images of outcomes.
VOLUME 7, 2019 66983

FIGURE 4. CNN architecture for detecting bird images.
TABLE 2. Convnet parameter configuration for bird image detection

system.
FIGURE 5. Input raw data and feature illustration for a classifier.
the database, and the softmax classifier produces the proba-

bility scores for each label. Fig. 5 presents a raw input image,
illustrating part selection and crucial feature identification.
Multiclassification predicts a category label with the highest
probability for the image.
E. SYSTEM IMPLEMENTATION
In this subsection, we explain using a high-resolution smart-
phone camera to identify and classify bird information [40]
based on deep learning. To complete the semantic bird search
D. FEATURE EXTRACTION task, we established a client–server architecture to bridge the
Extracting features from raw input images is the primary task communication gap between the cloud and mobile device
when extracting relevant and descriptive information for fine- over a network. The entire setup was executed in the follow-
grained object recognition [36]–[38]. However, because of ing manner:
semantic and intraclass variance, feature extraction remains • Raw bird images were distilled to remove irrelevant
challenging. We separately extracted the features in relevant parts and learned by the CNN to yield parameters on the
positions for each part of an image and subsequently learned GPU platform. Subsequently, a TF inference model [41]
the parts of the model features that were mapped directly to was developed in the workstation for deployment in the
the corresponding parts. The features were calculated using smartphone.
ReLU 5 and ReLU 6. Localization was used to find object • The output was detected using an Android app
parts defined by bounding box coordinates and their dimen- platform or through the web.
sions (width and height) in the image [39]. For the localiza- On the workstation/server side, the following segments
tion task an intersection over union score >0.5 was set for were considered. The TF backend session model for object
our model. An FC layer with a ReLU was used to predict the detection was prepared to save the TF computation graphs
location of bounding box Bx . Subsequent steps of the learning of input, output, weight, and bias as graph_def text files
algorithm were for learning the map of the feature vectors (tfdroid.pbtxt), which comprised the entire architecture of the
of the input image, deciding whether the region fit an object model. The CNN architecture was trained to load the raw
class of interest, and then classifying the expected output with input data of bird images using Keras [42] callbacks with
the correct labels in the image. For a given image, feature the predefined parameters into TF format to fit the model
vectors represent the probability of target object centrality in for inference. After training the model, the parameters of all
66984 VOLUME 7, 2019

saved session events of model progress in each epoch were

saved as a TF checkpoint (.ckpt) file. To deploy the trained
model on a smartphone, the graphs were frozen in TF format
using Python. Before the trained model was frozen, a saver
object was created for the session, and the checkpoints, model
name, model path, and input–output parameter layers of the
model were defined. All other explicit metadata assignments
that were not necessary for the client–server inference, such
as GPU directories on the graph nodes or graph paths, were
removed. In this bird detection model, the output layer pro-
vides: (a) the parts of the input image containing a bird,
FIGURE 6. Client–server architecture for bird detection.
(b) type of bird species, and (c) parts of the input image not
containing a bird. Finally, the trained model was frozen by
converting all variable parameters in the checkpoint file into
constants (stops). Subsequently, both files were serialized the uploaded images. Therefore, deploying the learned archi-
into a single file as a ProtoBuf graph_def. The graph frozen as tecture with the cloud-based model can be easily ported to
a ProtoBuf graph_def can be optimized, if required, for feasi- various platforms or mobile phones, and can upscale the
bility inference. The saved ProtoBuf graph_def was reloaded model with new features without much difficulty. Because of
and resaved to a serialized string value. The following actions the aforementioned benefits, cloud-based inference was used
were considered when optimizing for inference: to execute bird image recognition. Fig. 6 shows the proposed
• Removal of redundant variables system for bird information retrieval from the trained model
• Stripping out of unused nodes stored in the workstation. The server with the TF platform
• Compression of multiple nodes and expressions into a takes prediction requests for bird images from client mobile
distinct node phones and feeds and processes in the deep learning trained
• Removal of debug operations, such as CheckNumerics, model the images sent from the API. After an image has been
that are not necessary for inference predicted, the TF platform classifies and generates the proba-
• Group batch norm operation into precalculated weights bility distribution of the image and transmits the query image
• Fusing of common operations into unified versions result back to the user’s mobile phone with the classified
• Reduction of model size through quantization and label.
weight rounding To analyze the uploaded images, we used a mobile phone
• Fixing of hardware assignment problems as a client to perform the following functions: the end-user
Once the model was trained and saved for mobile infer- interface captures the bird image and instantly or directly
ence in the workstation, we created an Android app to copy uploads the image from the gallery of the mobile phone to
and configure the TF inference files. On the client/mobile extract image features. The mobile app sends an HTTPS
side, the SDK written in Java and NDK written in C++ request to the web server (central computer system) to
were downloaded to create mobile interface activities and to retrieve the pretrained database regarding the uploaded bird
communicate with the pretrained CNN TF ProtoBuf files that image. The server performs data aggregation and an exhaus-
contained the model definition parameters and weights. The tive search using the uploaded image [43] to determine the
JNI was used to bridge the TF and Android platforms. The matching parameters and retrieve information related to the
JNI executes the loadModel function and obtains predictions images. To optimize binary segmentation of the weighted
of an object from the TF ProtoBuf files using the Android graph of the image, Grabcut semantic foreground segmen-
NDK. After classifying the object in the pertained model, tation [44] is applied for bird species categorization. The
the classified label output is sent back to the mobile phone head of a bird is the main prior-fitted region of interest [45],
using the Android NDK. the other parts of the bird are lower priority regions of interest.
Using the aforementioned client–server computing setup, A color model is projected to filter the original image with the
we provided a mechanism to encapsulate the cloud and bounding box [46]. Subsequently, the information is classi-
mobile session. Bird recognition can be executed through fied and mapped, and the correctness of the matched image is
cloud- and device-based inference. In this approach to deep transmitted back to the user’s mobile phone. The transmitted
learning inference on a mobile device, the trained model file contains metadata related to the bird’s information with
parameters are loaded into the mobile app, and the computa- the classified label indicating a bird species. Fig. 7 shows the
tions are completed locally on the device to predict the image interface steps of bird detection.
output. The mobile phone is constrained by memory size and
inflexibility when updating the trained model. However, in V. EXPERIMENTAL RESULTS AND ANALYSIS
the cloud-based deep learning model, the trained model is Fig. 8 presents 27 bird species endemic to Taiwan. The pro-
stored on a remote server, and the server connects to the posed system can predict and differentiate bird and nonbird
mobile device via the Internet using a web API to predict images. When nonbird images are uploaded from a user’s
VOLUME 7, 2019 66985

TABLE 3. Hardware/software specifications used to execute the object

detection model.
FIGURE 7. App interface for determining bird species.

TABLE 4. Prediction results of images uploaded from smartphones.
TABLE 5. Performance comparisons of different α values. T1 = training

and T2 = test.
FIGURE 8. Bird species endemic to Taiwan.
categories. The base learning rate was 0.01 and subse-

quently shifted to 0.0001. The network was trained until the
cross-entropy stabilized. Skip connections were implemented
when the input and output layers had equal weights. For
instance, when the dimensions of the output were increased,
FIGURE 9. App interface with negative images. the weights were concatenated in a deeper layer to capture
and reconstruct features more effectively in the next layer.
We compared different α parameters to check their influences
smartphone, the system sends a notification to upload only on the final model. Table 5 compares the performances of
bird images as shown in Fig. 9. The hardware and software different α values in identifying whether a bird appears in an
specifications for inference engine execution are summarized image. In these experiments, 3563 images were split into sets
in Table 3. of 80% for training and 20% for testing. The comparisons
To filter nonbird images uploaded to the system automati- reveal that a high α increases the redundancy in the model;
cally and to validate the effectiveness of the proposed system, therefore, we set the α value to 0.5, which resulted in aver-
100 bird images were uploaded from a mobile phone for age accuracies of 100% and 99.7% for the training and test
preliminary testing. The model achieved 100% accuracy in datasets, respectively.
classifying the images as true bird pictures. Table 4 shows In this study we also compared the performance of three
the bird detection results. methods such as CNN with skip connections, CNN without
To acquire the output of images with or without birds, skip connections, and SVM with endemic bird dataset. The
the multiscale sliding window strategy was applied so that performance comparison of the model is set with learning
the extracted subwindow could define the target object rate of 0.00001 and 100 epochs. For the SVM, we used
66986 VOLUME 7, 2019

TABLE 6. Performance evaluation of the classification of bird images.
FIGURE 10. Performance comparison of the three models for the training
dataset.
a linear kernel for the high dimensionality of the feature

space. (with parameters gamma = 0.125 and cost (c) = 1).
The models were implemented in Python 3.6.6 using the
scikit-learn version 0.18.0 package. A comparison among the
three models for the training dataset is shown in Fig. 10.
The proposed CNN with skip connections achieved higher
accuracy of 99.00% than that of 93.98% from the CNN
without skip connections, and 89.00% from the SVM. This
validated the effectiveness of the presented model with skip
connections.
Before measuring the efficiency for the test dataset, we ran-
domly selected numbers of images, increasing from 10 to
100 images (with an increment of 10), from the test dataset
to predict their highest and five highest accuracies. If the
model’s top guess matched the target image, then the image
was listed as Top-1. Similarly, if the model predicted the
target bird at least among its top five guesses, then the image
was listed as Top-5. Results from 10-fold cross-validation
showed that accuracies of Top-1 were 91.20%–95.20% and VI. DISCUSSION
of Top-5 reached 93.00%–98.50% for test bird images. In this study, we developed an automatic model to classify
To further test the performance of the proposed system the 27 endemic birds of Taiwan by skipped CNN model.
in identifying individual bird species, the following metrics We performed an empirical study by the skip architecture.
were considered: The intuition behind using the skip connections is to pro-
TP + TN vide uninterrupted gradient flow from the early layer to later
Accuracy = (3) layer, so that it can resolve the vanishing gradient problem.
TP + FP + TN + FN
TP We compared the performance of various models such as
Sensitivity = (4) CNN with skip connections, CNN without skip connections,
TP + FN
and SVM. CNN with skip connections outperformed the
TN
Specificity = (5) other two algorithms.
TN + FP However, in this study, we are more focused on predicting
where TP, FP, TN, and FN represent true positive, false the 27 species of bird endemic to Taiwan more efficient
positive, true negative, and false negative rates, respectively. and effective. The proposed model can predict the uploaded
Accuracy denotes the ratio of correctly detected bird images image of a bird as bird with 100% accuracy. But due to
in the entire dataset, sensitivity indicates the ratio of correctly the subtle visual similarities between and among the bird
detected birds based on the bird images, and specificity is species, the model sometime lacks the interspecific compar-
the true negative rate of the images that are bird images. isons among the bird species and eventually leads to mis-
The performance evaluation of the bird images is shown classification. In average, the test dataset yields 93.79% of
in Table 6. The average sensitivity, specificity, and accuracy sensitivity, 96.11% of specificity and this model can be used
were 93.79%, 96.11%, and 95.37%, respectively. for prediction and classification of the endemic bird images.
VOLUME 7, 2019 66987

The proposed architecture encountered some limitations [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classifica-
and has room for improvement in the future. Sometime tion with deep convolutional neural networks,’’ in Proc. 25th Int. Conf.
Advance Neural Inf. Process. Syst., Lake Tahoe, NV, USA, Dec. 2012,
the model confused the prediction of endemic birds when pp. 1097–1105.
the uploaded bird images shared similar colors and size. [10] B. Zhao, J. Feng, X. Wu, and S. Yan ‘‘A survey on deep learning-based fine-
If most bird species within a district need to be retrieved grained object classification and semantic segmentation,’’ Int. J. Automat.
Comput., vol. 14, no. 2, pp. 119–135, Apr. 2017.
from the system, the database must be updated and need to
[11] H. Yang, J. T. Zhou, Y. Zhang, B.-B. Gao, J. Wu, and J. Cai, ‘‘Exploit
be retrained with new features of the birds. For extending bounding box annotations for multi-label object recognition,’’ in Proc. Int.
the proposed system to some specific districts for birdwatch- Conf. Comput. Vision Pattern Recognit., Las Vegas, NV, USA, Jun. 2016,
pp. 280–288.
ing may encounter imbalanced distribution of the dataset
[12] Li Liu, W. Ouyang, X. Wang, P. Fieguth, X. Liu, and M. Pietikäinen,
among the bird species if only a small size of dataset is ‘‘Deep learning for generic object detection: A survey,’’ Sep. 2018,
available. arXiv:1809.02165. [Online]. Available: https://arxiv.org/abs/1809.02165
In the future, we intend to develop a method for predicting [13] K. Dhindsa, K. D. Gauder, K. A. Marszalek, B. Terpou, and S. Becker,
‘‘Progressive thresholding: Shaping and specificity in automated neuro-
different generations of specific bird species within the intr- feedback training,’’ IEEE IEEE Trans. Neural Syst. Rehabil. Eng., vol. 26,
aclass and interclass variations of birds and to expand bird no. 12, pp. 2297–2305, Dec. 2018.
species to our database so that more people can admire the [14] C.-Y. Lee, A. Bhardwaj, W. Di, V. Jagadeesh, and R. Piramuthu, ‘‘Region-
beauty from watching birds. based discriminative feature pooling for scene text recognition,’’ in Proc.
Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 4050–4057.
[15] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, ‘‘Simultaneous
VII. CONCLUSIONS detection and segmentation,’’ in Proc. Eur. Conf. Comput. Vis., Jul. 2014,
This study developed a mobile app platform that uses cloud- pp. 297–312.
based deep learning for image processing to identify bird [16] S. Branson, G. V. Horn, S. Belongie, and P. Perona, ‘‘Bird species cate-
species from digital images uploaded by an end-user on a gorization using pose normalized deep convolutional nets,’’ in. Proc. Brit.
Mach. Vis. Conf., Nottingham, U.K., Jun. 2014, pp. 1–14.
smartphone. This study dealt predominantly with bird recog- [17] B. Yao, A. Khosla, and L. Fei-Fei, ‘‘Combining randomization and
nition of 27 Taiwan endemic bird species. The proposed sys- discrimination for fine-grained image categorization,’’ in Proc. CVPR,
tem could detect and differentiate uploaded images as birds Colorado Springs, CO, USA, USA, Jun. 2011, pp. 1577–1584.
[18] Y.-B. Lin, Y.-W. Lin, C.-M. Huang, C.-Y. Chih, and P. Lin, ‘‘IoTtalk:
with an overall accuracy of 98.70% for the training dataset. A management platform for reconfigurable sensor devices,’’ IEEE Internet
This study ultimately aimed to design an automatic system Things J., vol. 4, no. 5, pp. 1552–1562, Oct. 2017.
for differentiating fine-grained objects among bird images [19] X. Zhang, H. Xiong, W. Zhou, and Q. Tian, ‘‘Fused one-vs-all features with
with shared fundamental characteristics but minor variations semantic alignments for fine-grained visual categorization,’’ IEEE Trans.
Image Process., vol. 25, no. 2, pp. 878–892, Feb. 2016.
in appearance. [20] Google Android Developer. Render Script API Guides. Accessed:
In the future, we intend to develop a method for predicting Nov. 20, 2018. [Online]. Available: https://developer.android.com/guide/
different generations of specific bird species within the intr- topics/renderscript/compute.html
[21] M. Z. Andrew and G. Howard. (Jun. 14, 2017), MobileNets: Open-
aclass and interclass variations of birds and to add more bird Source Models for Efficient On-Device Vision. Accessed: Feb. 20, 2018.
species to our database. [Online]. Available: https://research.googleblog.com/2017/06/mobilenets-
open-source models-for.html
REFERENCES [22] C. Koylu, C. Zhao, and W. Shao, ‘‘Deep neural networks and kernel density
[1] D. T. C. Cox and K. J. Gaston, ‘‘Likeability of garden birds: Importance of estimation for detecting human activity patterns from Geo-tagged images:
species knowledge & richness in connecting people to nature,’’ PloS one, A case study of birdwatching on Flickr,’’ Int. J. Geo-Inf., vol. 8, no. 1, p. 45,
vol. 10, no. 11, Nov. 2015, Art. no. e0141505. Jan. 2019.
[2] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, [23] R. Richa, R. Linhares, E. Comunello, A. von Wangenheim, J.-Y. Schnitzler,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and B. Wassmer, C. Guillemot, G. Thuret, P. Gain, G. Hager, and R. Taylor,
L. Fei-Fei, ‘‘ImageNet large scale visual recognition challenge,’’ Int. J. ‘‘Fundus image mosaicking for information augmentation in computer-
Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015. assisted slit-lamp imaging,’’ IEEE Trans. Med. Imag., vol. 33, no. 6,
[3] H. Yao, S. Zhang, Y. Zhang, J. Li, and Q. Tian, ‘‘Coarse-to-fine descrip- pp. 1304–1312, Jun. 2014.
tion for fine-grained visual categorization,’’ IEEE Trans. Image Process., [24] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, ‘‘Deep con-
vol. 25, no. 10, pp. 4858–4872, Oct. 2016. volutional neural network architecture with reconfigurable computation
[4] F. Garcia, J. Cervantes, A. Lopez, and M. Alvarado, ‘‘Fruit classification patterns,’’ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25, no. 8,
by extracting color chromaticity, shape and texture features: Towards an pp. 2220–2233, Aug. 2017.
application for supermarkets,’’ IEEE Latin Amer. Trans., vol. 14, no. 7, [25] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. O. Ogunbona, ‘‘Action
pp. 3434–3443, Jul. 2016. recognition from depth maps using deep convolutional neural networks,’’
[5] L. Zhu, J. Shen, H. Jin, L. Xie, and R. Zheng, ‘‘Landmark classification IEEE Trans. Human–Mach. Syst., vol. 46, no. 4, pp. 498–509, Aug. 2016.
with hierarchical multi-modal exemplar feature,’’ IEEE Trans. Multimedia, [26] B. Du, W. Xiong, J. Wu, L. Zhang, L. Zhang, and D. Tao, ‘‘Stacked
vol. 17, no. 7, pp. 981–993, Jul. 2015. convolutional denoising auto-encoders for feature representation,’’ IEEE
[6] X. Liang, L. Lin, W. Yang, P. Luo, J. Huang, and S. Yan, ‘‘Clothes Trans. Cybern., vol. 47, no. 4, pp. 1017–1027, Apr. 2017.
co-parsing via joint image segmentation and labeling with applica- [27] L. Yang, A. M. MacEachren, P. Mitra, and T. Onorati, ‘‘Visually-enabled
tion to clothing retrieval,’’ IEEE Trans. Multimedia, vol. 18, no. 6, active deep learning for (Geo) text and image classification: A review,’’ Int.
pp. 1175–1186, Jun. 2016. J. Geo-Inf., vol. 7, no. 2, p. 65, Feb. 2018.
[7] Y.-P. Huang, L. Sithole, and T.-T. Lee, ‘‘Structure from motion technique [28] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato,
for scene detection using autonomous drone navigation,’’ IEEE Trans. A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng, ‘‘Large scale
Syst., Man, Cybern., Syst., to be published. distributed deep networks,’’ in Proc. 25th Int. Conf. Adv. Neural Inf.
[8] C. McCool, I. Sa, F. Dayoub, C. Lehnert, T. Perez, and B. Upcroft, Process. Syst., Lake Tahoe, NV, USA, Dec. 2012, pp. 1223–1231.
‘‘Visual detection of occluded crop: For automated harvesting,’’ in [29] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, ‘‘How transferable are
Proc. Int. Conf. Robot. Autom. (ICRA), Stockholm, Sweden, May 2016, features in deep neural networks,’’ in Proc. Int. Conf. Advance Neural Inf.
pp. 2506–2512. Process. Syst., Dec. 2014, pp. 3320–3328.
66988 VOLUME 7, 2019

[30] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image [45] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan,
recognition,’’ in Proc. Int. Conf. Comput. Vis. Pattern Recognit., Jun. 2016, ‘‘HCP: A flexible CNN framework for multi-label image classification,’’
pp. 770–778. IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 9, pp. 1901–1907,
[31] Nvidia. GEFORCE GTX 1080Ti. Accessed: Jan. 25, 2018. [Online]. Avail- Sep. 2016.
able: https://www.nvidia.com/en-us/geforce/products/10series/geforce- [46] K.-J. Hsu, Y.-Y. Lin, and Y.-Y. Chuang, ‘‘Augmented multiple instance
gtx-1080-ti/ regression for inferring object contours in bounding boxes,’’ IEEE Trans.
[32] Intel. Intel Xeon Phi Product Family. Accessed: Jan. 2, 2018. Image Process., vol. 23, no. 4, pp. 1722–1736, Apr. 2014.
[Online]. Available: https://www.intel.com/content/www/us/en/products/
processors/xeon-phi.html?cid=sem43700027892951748&intel_term=
intel+xeon&gclid=CjwKCAiAxuTQBRBmEiwAAkFF1ohAPPZlb3p
EhujFDN_w9cgzqp4lPeGrui6WbsXSyW3rApIspzkhKhoCbu4QAvD_ YO-PING HUANG (S’88–M’92–SM’04) received
BwE&gclsrc=aw.ds the Ph.D. degree in electrical engineering from
[33] Nvidia. GPUs are Driving Energy Efficiency Across the Computing Indus- Texas Tech University, Lubbock, TX, USA.
try, From Phones to Supercomputers. Accessed: Nov. 25, 2018. [Online]. He was a Professor and the Dean of Research
Available: http://www.nvidia.com/object/gcr-energy-efficiency.html and Development (2005–2007), the Dean of the
[34] H. Yao, D. Zhang, J. Li, J. Zhou, S. Zhang, and Y. Zhang, ‘‘DSP: Discrimi- College of Electrical Engineering and Computer
native spatial part modeling for fine-grained visual categorization,’’ Image Science (2002–2005), and the Department Chair
Vis. Comput., vol. 63, pp. 24–37, Jul. 2017. (2000–2002) of Tatung University, Taipei. He is
[35] Stanford. CS231n.2017: Convolutional Neural Networks for Visual Recog- currently a Professor with the Department of Elec-
nition. Accessed: Oct. 25, 2017. [Online]. Available: http://cs231n.github. trical Engineering, National Taipei University of
io/convolutional-networks/ Technology, Taipei, Taiwan, where he has served as the Secretary-General
[36] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, ‘‘Part-based R-CNNs (2008–2011). His current research interests include fuzzy systems design
for fine-grained category detection,’’ in Proc. Int. Conf. Eur. Conf. Comput. and modeling, deep learning modeling, intelligent control, medical data
Vis., Cham, Switzerland, Jul. 2014, pp. 834–849.
mining, and rehabilitation systems design. He is an IET Fellow (2008) and
[37] L. Xie, J. Wang, B. Zhang, and Q. Tian, ‘‘Fine-grained image search,’’
an International Association of Grey System and Uncertain Analysis Fellow
IEEE Trans. Multimedia, vol. 17, no. 5, pp. 636–647, May 2015.
[38] C. Huang, Z. He, G. Cao, and W. Cao, ‘‘Task-driven progressive part (2016). He serves as the President of the Taiwan Association of Systems
localization for fine-grained object recognition,’’ IEEE Trans. Multimedia, Science and Engineering, the IEEE SMCS BoG, the Chair of the IEEE
vol. 18, no. 12, pp. 2372–2383, Dec. 2016. SMCS Technical Committee on Intelligent Transportation Systems, and the
[39] D. Lin, X. Shen, C. Lu, and J. Jia, ‘‘Deep LAC: Deep localization, Chair of the Taiwan SIGSPATIAL ACM Chapter. He was the Chair of the
alignment and classification for fine-grained recognition,’’ in Proc. Int. IEEE SMCS Taipei Chapter, the Chair of the IEEE CIS Taipei Chapter, and
Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, Jun. 2015, the CEO of the Joint Commission of Technological and Vocational College
pp. 1666–1674. Admission Committee, Taiwan (2011–2015).
[40] Y.-P. Huang and T. Tsai, ‘‘A fuzzy semantic approach to retrieving bird
information using handheld devices,’’ IEEE Intell. Syst., vol. 20, no. 1,
pp. 16–23, Jan./Feb. 2005. HAOBIJAM BASANTA received the M.C.A.
[41] TensorFlow. Building TensorFlow on Android. Accessed: Sep. 20, 2017.
degree from the University of Jamia Millia
[Online]. Available: https://www.tensorflow.org/mobile/android_build
Islamia, New Delhi, India. He is currently pursu-
[42] Keras, Keras: The Python Deep Learning library. Accessed: Sep. 25, 2017.
[Online]. Available: https://keras.io/ ing the Ph.D. degree in electrical engineering and
[43] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and computer science with the National Taipei Uni-
A. W. M. Smeulders, ‘‘Selective search for object recognition,’’ Int. versity of Technology, Taipei, Taiwan. His current
J. Comput. Vis., vol. 104, no. 2, pp. 154–171, Apr. 2013. research interests include the Internet of Things
[44] H. Zheng, Y. Huang, H. Ling, Q. Zou, and H. Yang, ‘‘Accurate segmen- (IoT) for the elderly healthcare systems, big data
tation for infrared flying bird tracking,’’ Chin. J. Electron., vol. 25, no. 4, analytics, deep learning, and image processing.
pp. 625–631, Jul. 2016.
VOLUME 7, 2019 66989

Bird Image Retrieval and Recognition Using A Deep Learning Platform

Uploaded by

Copyright:

Available Formats

Bird Image Retrieval and Recognition Using A Deep Learning Platform

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bird Image Retrieval and Recognition Using A Deep Learning Platform

Uploaded by

Copyright:

Available Formats

Received May 2, 2019, accepted May 19, 2019, date of publication May 22, 2019, date of current version

Bird Image Retrieval and Recognition Using

Corresponding author: Yo-Ping Huang (yphuang@ntut.edu.tw)

I. INTRODUCTION objects, including vegetables and fruits [4], landmarks [5],

II. RELATED WORK

tures such as colors, textures, and shapes from multilabel

VOLUME 7, 2019 66981

image sources add diversity to the dataset, the images may

66982 VOLUME 7, 2019

of 27 bird species. The dataset was split into 2280 images

VOLUME 7, 2019 66983

FIGURE 4. CNN architecture for detecting bird images.

TABLE 2. Convnet parameter configuration for bird image detection

FIGURE 5. Input raw data and feature illustration for a classifier.

the database, and the softmax classifier produces the proba-

66984 VOLUME 7, 2019

saved session events of model progress in each epoch were

VOLUME 7, 2019 66985

TABLE 3. Hardware/software specifications used to execute the object

FIGURE 7. App interface for determining bird species.

TABLE 5. Performance comparisons of different α values. T1 = training

FIGURE 8. Bird species endemic to Taiwan.

categories. The base learning rate was 0.01 and subse-

66986 VOLUME 7, 2019

TABLE 6. Performance evaluation of the classification of bird images.

a linear kernel for the high dimensionality of the feature

VOLUME 7, 2019 66987

66988 VOLUME 7, 2019

VOLUME 7, 2019 66989

You might also like