Zhang 2020

An End-to-end System for Pests and Diseases
Identification
Ning Zhang Zuochang Ye Yan Wang
Institute of Microelectronics Institute of Microelectronics Institute of Microelectronics
Tsinghua University Tsinghua University Tsinghua University
zn17@mails.tsinghua.edu.cn zuochang@mail.tsinghua.edu.cn wangyan@mail.tsinghua.edu.cn
ABSTRACT developing more intelligent remote-sensing image target

The traditional pests and diseases identification methods do not interpretation systems. Therefore, research on algorithms for
work well for massive high-resolution remote sensing image data. detecting and identifying remote sensing image targets based on
Thus, we are expected to find an efficient way to automatically big data has become a trend. Traditional target detection and
learn the presentations from the massive image data, and find the recognition methods are difficult to adapt to massive data, and the
relationships among the data. This paper proposes an end-to-end feature expressions they rely on are designed manually such as
system for pests and diseases identification in massive high- SITF [37], SURF [38], etc. But these methods are very time-
resolution remote sensing data based on deep learning. To achieve consuming and strongly depend on the characteristics of the data.
good performance on pests and diseases identification, this Therefore, it is necessary to find a method that can automatically
hierarchical model jointly learns the parameters of a neural learn features.
network and the cluster assignments of the features. Our network On the other hand, researchers have begun to apply the popular
named ClusterNet iteratively groups the features with a standard method deep learning to image recognition, and have made great
clustering algorithm k-means, and uses the subsequent progress in recent years. Complex network structure and huge
assignments as supervision to update the weights of the network. data samples are the biggest features of deep learning. The
Qualitatively, we only need to provide the remote sensing image emergence of deep learning technology has provided a strong
of target area, and the system will automatically identify pests and technical guarantee for huge data samples. So it’s very suitable to
diseases. This is more accurate and convenient compared to the use deep learning to pests and diseases identification on remote
traditional method of manual detection. Quantitatively the sensing image.
resulting model outperforms the traditional convolutional neutral
networks on our pests and diseases remote sensing dataset. In most computer vision applications [1, 2, 3, 4,32] these years,
pre-trained convolutional neural networks have become the
CCS Concepts building blocks. They extract excellent general-purpose features
• Information systems➝ Information systems applications that can be used to improve the generalization of models learned
• Computing methodologies➝ Artificial intelligence on a limited amount of data [5,6]. The existence of ImageNet, a
large fully-supervised dataset, has been pushing the process of
Keywords convnet’s pre-training. However, [7] have recently provided
Remote Sensing; Pests and Diseases Identification; Deep learning; empirical evidence that the performance of state-of-the-art
Cluster classifiers on ImageNet is largely underestimated, and little error
is left unresolved. This explains in part why the performance has
1. INTRODUCTION been saturating despite many novel architectures were proposed in
The continuous development of economy and society has brought recent years [2, 8, 9]. In fact, ImageNet is relatively small by
a lot of global climate and environmental problems. The today's standards; it contains "only" one million images covering
occurrence of diseases and the mutation of fungi and bacteria have a specific domain of object classification. The natural way
affected people's lives. The incidence of pests and crop diseases is forward is to build a larger and more diverse dataset, which may
getting higher and higher. Therefore, it is particularly important to contain billions of images. In turn, although the community has
study the identification and prevention of pests and diseases. accumulated crowdsourcing expertise over the years, it still
requires a lot of manual annotations. Replacing labels with raw
On the one hand, with the rapid development of high-resolution
metadata can cause visual representations to be biased with
satellites[26], the high-resolution remote-sensing image data has
unpredictable consequences [10]. This requires methods that can
increased dramatically, which has provided the possibility for jointly learn the parameters of a neural network and the cluster
Permission to make digital or hard copies of all or part of this work for assignments of the features.
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies In this work, we propose an end-to-end system for pests and
bear this notice and the full citation on the first page. Copyrights for diseases identification in massive high-resolution remote sensing
components of this work owned by others than ACM must be honored. data. The architecture of our system is shown in Figure 1 and will
Abstracting with credit is permitted. To copy otherwise, or republish, to be introduced in detail in Section3. The core of our system is our
post on servers or to redistribute to lists, requires prior specific permission network called ClusterNet. We adopt a novel clustering approach
and/or a fee. Request permissions from Permissions@acm.org.
IVSP 2020, March 20–22, 2020, Singapore, Singapore
for the large-scale end-to-end training of convnets. We show that
© 2020 Association for Computing Machinery. it is possible to obtain useful general-purpose visual features with
ACM ISBN 978-1-4503-7695-2/20/03…$15.00 a clustering framework. ClusterNet consists in alternating between
DOI: https://doi.org/10.1145/3388818.3389155 clustering of the image descriptors and updating the weights of the
convnet by predicting the cluster assignments. For simplicity, we
use k-means, but other clustering approaches can also be used,
like Power Iteration Clustering (PIC) [11]. Unlike other self- pretext task. Except spatial cues, many other signals have also
supervised methods [12,13,14], clustering has the advantage of been explored like cross-channel prediction [21], image
requiring little no specific signal from the inputs and domain colorization [13,22], sound [23] or instance counting [14]. More
knowledge [15,16]. Despite its simplicity, our approach achieves recently, combining multiple cues have been popular and became
significantly higher performance than traditional convolutional a new branch [24,25]. But all these methods requiring expert
neutral networks on our pests and diseases remote sensing dataset. knowledge to carefully design a pretext task and domain
Finally, we probe the robustness of our framework by experiments. dependent, which may lead to transferable features.
2. RELATED WORK 3. SYSTEM

2.1 Pest and Disease Identification The end-to-end system for pests and diseases identification can be
The traditional method of manual detection of pests and diseases divided into three modules: RS-DP model, ClusterNet and
completely depends on the observation experience of farmers or visualization model. The relationship between them can be
asking experts to the scene. These methods are slow, low explained in Figure 1:
efficiency, high cost, strong subjectivity and low accuracy. With
the continuous development of the Internet, the use of information
technology has provided new methods and ideas for the
identification of crop diseases and insect pests. The use of
efficient image recognition technology can improve the efficiency
of image recognition, reduce costs, and improve the accuracy of
recognition. To this end, a lot of research has been carried out by
experts and scholars, and deep learning has become the focus of
research. The high-resolution remote-sensing image data has also
provided the possibility for developing more intelligent pests and
diseases identification systems
2.2 Deep Convolutional Neutral Networks

Starting from Yann LeCun’s LeNet-5[27], convolutional neural
networks (CNN) have typically consists of the following parts:
stacked convolutional layers followed by normalization layer and
pooling layer usually and fully-connected layers. This design is
common in the task of image classification, object detection,
image segmentation, and has made a great progress on MNIST,
CIFAR and large datasets such as ImageNet. In order to improve
accuracy, the usual practice is to increase the number of
layers(deeper) [31] and layer size(wider)[29]. And using dropout
[36] to address the problem of overfitting.
2.3 Image Classification

Image classification is to distinguish different kinds of images
according to the semantic information of images, which is an
important basic problem in computer vision and also the basis of
other high-level visual tasks such as image detection, image
segmentation, object tracking and behavior analysis. Convolution
Neural networks (CNN) in deep learning model have made
amazing achievements in image field in recent years. Since the
milestone work of AlexNet [28], the ImageNet classification
accuracy has been significantly improved by novel structures,
including VGG [29], GoogLeNet [30], ResNet [31], DenseNet [2],
ResNeXt [33], SE-Net [34], and automatic neutral architecture
search [35], each progress of network corresponds to the
enhancement of feature extraction ability and improvement of
classification accuracy.
2.4 Self-Supervised Learning

"Self-supervised learning" [17] is an important form of
unsupervised learning. It replaces human-annotated labels with Figure 1. The System Process
"pseudo-labels" and calculates directly from the raw input data, The first step is to input the remote sensing image and a list of
thereby achieving “self-monitoring”. The difference of these latitude and longitude coordinates of pests and diseases into the
methods is the different selection methods for "pseudo-labels". RS-DP module to get the dataset which need to be input to the
And here are some popular methods about self-supervised convolutional neural network. The second module is the core of
learning. In [18], missing pixels are estimated based on their our system called ClusterNet to get prediction. And the last one is
surrounding and this method mainly uses spatial cues of images. visualization model which is visualizing predictions. The
[19] train a network to spatially rearrange patches, while [20] use following sections (3.1 to 3.4) describe each module in detail.
the prediction of the relative position of patches in an image as a
3.1 Remote Sensing Data Preprocessing Then we can compute using gradient descent and
The first module is RS-DP model means remote sensing data backpropagation.
preprocessing. The original input of our system is a remote
sensing image and a list of latitude and longitude coordinates of
3.3 Image Classification by clustering
pests and diseases. With the rapid development of high-resolution Our ClusterNet (see Figure 3) consists of three parts: input data,
satellites, the high-resolution remote-sensing image data has convnet and cluster module. This chapter introduces the principle
increased dramatically. Our input is remote sensing image of of cluster module. Input data and convnet will be shown in 3.4.
Dangyang, a city in Hubei province, China. Its visualization
image is shown in Figure 2.
Figure 3. ClusterNet Architecture

Figure 2. remote sensing image
In this module, we need to convert the remote sensing image and 3.3.1 Cluster Module
the list of latitude and longitude coordinates of pests and diseases When w is initialized from a Gaussian distribution, without any
to dataset which need to be input to the convolutional neural learning, the mapping will not produce good features. For
network. The size of the remote sensing image for example as example, according to [6], a multilayer perceptron classifier on
show in Figure 2 left is 7945*7568. And the length of the list top of the last convolutional layer of a random AlexNet [28]
consists of latitude and longitude coordinates of pests and diseases achieves 12% in accuracy on ImageNet while the chance is at
is 4571. There are several red pixels among the green pixels in the 0.1%. So random convnets with convolutional structure have good
pest and disease areas on high-resolution remote-sensing image. performance which gives a strong prior on the input signal. It uses
Therefore, another input is a list of isolated latitude and longitude a weak signal to help networks improve the discriminative ability.
coordinates and we cannot convert this problem to object Similar to this idea, we can iteratively group the features with a
detection or semantic segmentation. From the original remote standard clustering algorithm k-means, and use the subsequent
sensing image, the pest and disease areas are only several pixels assignments as supervision to update the weights of the network.
around normal areas. If we model it as a detection or a On the one hand, the reason we use k-means is that this method is
segmentation problem, that means the whole predicted areas are simple. On the other hand, preliminary results with other
pests and diseases which goes against the facts. To be more clustering algorithms indicates that this choice is not crucial. K-
precise, we can only model it as a classification problem. Some means takes a set of vectors as input, in our case the features
implementation details will be shown in 3.4. produced by the last convolutional layer, and clusters
them into k distinct groups based on the feature map. Solving this
3.2 Preliminaries problem provides a set of optimal assignments ( ) n≤N.
After RS-DP module, we get dataset consists of pd data and These assignments are then used as pseudo-labels. The specific
normal data. Modern approaches to image classification or any loss function is shown in 3.3.2.
other computer vision tasks based on statistical learning, usually
require good image featurization. In this context, CNN is a
3.3.2 Cluster Loss
popular choice for mapping raw images to a vector space of fixed According to 3.3.1, in addition to the final output prediction, we
dimensionality. Inn these problems, we denote by the CNN also get the clustering results according to the features
mapping, where w is the set of parameters in network. We apply produced by the last convolutional layer. So, we can update Eq. (1)
this mapping to an image and obtain the vector called feature. as follows:
Given a training dataset of N images denoted as

, we need to find a parameter such that
the mapping produces good general-purpose features.
Traditionally are learned with supervision, i.e. each image Where K is k-means algorithm, n is the dimension of the last
is associated with a ground truth label which is one-hot vector. convolutional layer’s channels, is the network without last
This label represents the image belongs to one of classes. The fully-connected layer. Overall, ClusterNet alternates between
dimension of this label vector is the number of classes. The clustering the features to produce pseudo-labels using Eq. (2) and
network with parameter is the result classifier predicts the updating the parameters of the convnet by predicting these
correct labels on top of the features . So we can easily pseudo-labels using Eq. (1) and Eq. (2) together. This type of
model an optimization problem used norm loss function : alternating procedure is prone to trivial solutions; we describe
how to avoid trivial solutions in section3.3.3.
3.3.3 Avoid Trivial Solutions
The existence of trivial solutions is common to any method that
jointly learns a discriminative classifier and the labels. Solutions
are usually different in terms of processing the minimal number of optimization. Only the pd point is out of the 80*80 central area,
points per cluster[13,24]. And we will briefly introduce simple this small image can be count as a normal data.
and scalable methods we used here. Mainly in the following two
aspects:
Trivial parametrization. If the majority of training dataset is
assigned to a few clusters, the parameters w will be learned to
control them. In the most dramatic scenario where all but one
cluster are singleton, minimizing Eq. (1) and Eq. (2) leads to a
trivial parametrization where the ClusterNet will predict the same
output regardless of the input. This issue also arises when the
number of images per class is highly unbalanced. This is the
reason that we need to keep the balance of dataset. A strategy to
circumvent this issue is to sample images based on a uniform
distribution over the classes, or pseudo-labels. This is equivalent Figure 5. generate normal dataset
to weight the contribution of an input to the loss function in Eq. (1) The specific example remote sensing image dataset information is
and Eq. (2) by the size of training dataset. shown in Table 1 below.
Empty clusters. This issue is solved by reassigning empty clusters Table 1. The example Fig1’s dataset information
during the k-means optimization. More precisely, when a cluster
becomes empty, we randomly select a non-empty cluster and use Dataset Number of train Number of val
its centroid as the new centroid for the empty cluster. We then
reassign the points belonging to that non-empty cluster to the two Pest and Disease 4000 463
resulting clusters. Normal 4000 487
3.4 Implementation Details 4. EXPERIMENT
We define the area contains pests and diseases as pd area or
Three parts of experiments are presented in this section. In the
normal area. So we need to generate dataset consists of pd data
first set, we conduct the image classification experiments to verify
and normal data according to the list of latitude and longitude
the effectiveness of our end-to-end system and our ClusterNet. In
coordinates of pests and diseases.
the second set, we demonstrate the effect of our cluster loss (or
First of all, we need to convert geographic coordinates to pixel cluster module) and the architecture of convnet. Finally, we show
coordinates. Let and denote the geographic the visualizations of the system. The basic convnet of our
coordinate of point on the remote sensing image and ClusterNet is ResNet18. And the dataset we used are remote
variation of latitude and longitude of each pixel. sensing images Dangyang, MNIST, CIFAR-10 and ImageNet. All
experiments are carried out using the Pytorch framework on
NVIDIA TitanX GPUs.
4.1 Image Classification

Remote Sensing Dataset The system’s dataset has been
introduced specifically in section3.1 and 3.4. We first compare the
where is the point’s geographic coordinate and is ResNet18 and ClusterNet(ResNet18 + cluster module). The
the corresponding pixel coordinate. Visualization of the list of networks are trained with a batch size of 128 for 500 epochs; and
latitude and longitude coordinates of pests and diseases after the learning rate is set to 0.1 and then divided by 10 at the
conversion is shown in Figure 2 right. epoch and the epoch respectively. We use a weight
decay of and the SGD optimization algorithm. The
network weights are initialized using the method introduced in [9].
The classify accuracies are shown in Table 2. Results in the first
row were RestNet18 with the loss Eq. (1). For the cluster loss Eq.
(2), we train the ClusterNet by ourselves and results are in the
second row. The proposed ClusterNet outperforms the traditional
loss. And this accuracy far exceeds the project requirements.
Table 2. Error rates (%) on r-s dataset
Loss function ResNet18 ClusterNet
Figure 4. generate pd dataset
Softmax Loss 14.46 12.77
Then, to generate pd dataset, we used center crop around the pd
points. The process is shown in Figure 4. The small image’s size Cluster Loss 10.41
is a hyperparameter which is finally determined as 100.
In contrast, to generate normal dataset, we used center crop
around random points on the remotes sensing image. The small In addition, we have used other datasets to verify the effectiveness
image’s size is the same as pd dataset. As shown in Figure 5 of the system：MNIST Handwritten Digit dataset；CIFAR-10
right’s four pictures: if there is a white space on the small image consists of 32 × 32 pixel colored images, with 50,000 training
or a pd point in the 80*80 central area, this small image will be images and 10,000 testing images. We adopt the standard data
discarded. The size 80 is also a result after hyperparameter augmentation scheme including mirroring and 32 × 32 random
cropping after 4 pixel zero-paddings on each side. We investigate The final prediction result is shown in Figure 7. The red points are
the performance on large-scale image classification using the our ground truth which are visualizations of the list of latitude and
ImageNet dataset. But this system is for pests and diseases longitude coordinates of pests and diseases. And the blue points
identification with remote sensing image, in order to better fit the are predicted as pests and diseases area when points in Figure 6
actual use， only ten classes were randomly sampled from the are input to the end-to-end system. And according to section 4.1
ImageNet dataset. The experimental setup is the same as Remote and 4.2, the accuracy of our system can be up to 89.59% which is
Sensing Dataset. And the classify results are shown in Table 3. much higher than traditional methods. And the system only costs
seven minutes from the data input and visualization output which
Table 3. Error rates (%) on MNIST, CIFAR-10, ImageNet is much faster and more convenient than human experts. The
ResNet18 + ClusterNet + system’s recall rate verification result is shown in Figure 8. For
Dataset pests and diseases identification, recall rate is more important.
Softmax Loss Cluster Loss
MNIST 1.82 1.37 And our system’s recall rate is 95.47%. The yellow points in
Figure 8 are predicted results.
CIFAR-10 8.75 7.46
ImageNet (10) 9.19 6.92
4.2 Ablation Studies

In this section, we compare the effect of cluster module’s
dimension with Remote Sensing dataset. The cluster centroid’s
dimension is (x, y). x is the number of classes and the y is
depended on the cluster module’s architecture. In other words, y is
the number of last convolutional layer’s channels, we set y is 128,
256, 512 here. And the results are in Table 4.
Table 4. Error rates (%) with different cluster module
Loss function Softmax Loss Cluster Loss
ResNet18 14.46
ClusterNet(y=128) 16.93 14.67
ClusterNet(y=256) 12.77 10.41
Figure 7. The end-to-end system’s prediction result
ClusterNet(y=512) 13.58 13.29
So, we finally choose the ClusterNet with 2*256 cluster module in

our system.
4.3 Visualizations
Using Remote Sensing dataset, we visualized the predictions on
the remote sensing image. The random sampling on remote
sensing image visualization is shown in Figure 6. All the red
points are in the test dataset.
Figure 8. Recall rate verification result
5. CONCLUSION
In this paper, we propose an end-to-end system for pests and
diseases identification in massive high-resolution remote sensing
data based on deep learning. To achieve good performance, this
hierarchical model ClusterNet jointly learns the parameters of a
neural network and the cluster assignments of the features.
Extensive experiments demonstrate the effectiveness of system.
This system is much more accurate and convenient compared to
the traditional method of manual detection. And our innovative
ClusterNet and cluster loss can be used to learn deep
Figure 6. Random sampling on r-s image visualization representations and image classification.
6. ACKNOWLEDGMENTS [16] Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A.
This work is supported by the National Natural Science (2007, June). Object retrieval with large vocabularies and
Foundation of China. fast spatial matching. In 2007 IEEE Conference on Computer
Vision and Pattern Recognition (pp. 1-8). IEEE.
7. REFERENCES [17] de Sa, V. R. (1994). Learning classification with unlabeled
[1] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep data. In Advances in neural information processing
into rectifiers: Surpassing human-level performance on systems (pp. 112-119).
imagenet classification. In Proceedings of the IEEE
[18] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros,
international conference on computer vision (pp. 1026-1034).
A. A. (2016). Context encoders: Feature learning by
[2] Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. inpainting. In Proceedings of the IEEE conference on
Q. (2017). Densely connected convolutional networks. computer vision and pattern recognition (pp. 2536-2544).
In Proceedings of the IEEE conference on computer vision
[19] Noroozi, M., & Favaro, P. (2016, October). Unsupervised
and pattern recognition (pp. 4700-4708).
learning of visual representations by solving jigsaw puzzles.
[3] Masci, J., Meier, U., Cireşan, D., & Schmidhuber, J. (2011, In European Conference on Computer Vision (pp. 69-84).
June). Stacked convolutional auto-encoders for hierarchical Springer, Cham.
feature extraction. In International Conference on Artificial
[20] Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised
Neural Networks (pp. 52-59). Springer, Berlin, Heidelberg.
visual representation learning by context prediction.
[4] Erhan, D., Bengio, Y., Courville, A., & Vincent, P. (2009). In Proceedings of the IEEE International Conference on
Visualizing higher-layer features of a deep Computer Vision (pp. 1422-1430).
network. University of Montreal, 1341(3), 1.
[21] Zhang, R., Isola, P., & Efros, A. A. (2017). Split-brain
[5] Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., autoencoders: Unsupervised learning by cross-channel
Lamb, A., Arjovsky, M., & Courville, A. (2016). prediction. In Proceedings of the IEEE Conference on
Adversarially learned inference. arXiv preprint Computer Vision and Pattern Recognition (pp. 1058-1067).
arXiv:1606.00704.
[22] Zhang, R., Isola, P., & Efros, A. A. (2016, October). Colorful
[6] Friedman, J., Hastie, T., & Tibshirani, R. (2001). The image colorization. In European conference on computer
elements of statistical learning (Vol. 1, No. 10). New York: vision (pp. 649-666). Springer, Cham.
Springer series in statistics.
[23] Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., &
[7] Xu, L., Neufeld, J., Larson, B., & Schuurmans, D. (2005). Torralba, A. (2016, October). Ambient sound provides
Maximum margin clustering. In Advances in neural supervision for visual learning. In European conference on
information processing systems (pp. 1537-1544). computer vision (pp. 801-816). Springer, Cham.
[8] Yang, J., Parikh, D., & Batra, D. (2016). Joint unsupervised [24] Wang, X., He, K., & Gupta, A. (2017). Transitive invariance
learning of deep representations and image clusters. for self-supervised visual representation learning.
In Proceedings of the IEEE Conference on Computer Vision In Proceedings of the IEEE international conference on
and Pattern Recognition (pp. 5147-5156). computer vision (pp. 1329-1338).
[9] Lin, F., & Cohen, W. W. (2010). Power iteration clustering. [25] Doersch, C., & Zisserman, A. (2017). Multi-task self-
[10] Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see supervised visual learning. In Proceedings of the IEEE
by moving. In Proceedings of the IEEE International International Conference on Computer Vision (pp. 2051-
Conference on Computer Vision (pp. 37-45). 2060).
[11] Malisiewicz, T., Gupta, A., & Efros, A. (2011). Ensemble of [26] Tao, C., Tan, Y., Cai, H. J., Du, B., & Tian, J. W. (2010).
exemplar-svms for object detection and beyond. Object-oriented method of hierarchical urban building
extraction from high-resolution remote-sensing
[12] Turk, M. A., & Pentland, A. P. (1991, June). Face imagery. Acta Geodaetica et Cartographica Sinica, 39(1),
recognition using eigenfaces. In Proceedings. 1991 IEEE 39-45.
Computer Society Conference on Computer Vision and
Pattern Recognition (pp. 586-591). IEEE. [27] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
Gradient-based learning applied to document
[13] Larsson, G., Maire, M., & Shakhnarovich, G. (2016, recognition. Proceedings of the IEEE, 86(11), 2278-2324.
October). Learning representations for automatic colorization.
In European Conference on Computer Vision (pp. 577-593). [28] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
Springer, Cham. Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing
[14] Noroozi, M., Pirsiavash, H., & Favaro, P. (2017). systems (pp. 1097-1105)
Representation learning by learning to count. In Proceedings
of the IEEE International Conference on Computer [29] Simonyan, K., & Zisserman, A. (2014). Very deep
Vision (pp. 5898-5906). convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556.
[15] Van De Sande, K., Gevers, T., & Snoek, C. (2009).
Evaluating color descriptors for object and scene [30] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
recognition. IEEE transactions on pattern analysis and Anguelov, D., ... & Rabinovich, A. (2015). Going deeper
machine intelligence, 32(9), 1582-1596. with convolutions. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 1-9).
[31] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual [35] Zoph, B., & Le, Q. V. (2016). Neural architecture search
learning for image recognition. In Proceedings of the IEEE with reinforcement learning. arXiv preprint
conference on computer vision and pattern recognition (pp. arXiv:1611.01578.
770-778). [36] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., &
[32] Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shufflenet: Salakhutdinov, R. (2014). Dropout: a simple way to prevent
An extremely efficient convolutional neural network for neural networks from overfitting. The journal of machine
mobile devices. In Proceedings of the IEEE Conference on learning research, 15(1), 1929-1958.
Computer Vision and Pattern Recognition (pp. 6848-6856). [37] Lowe, D. G. (2004). Distinctive image features from scale-
[33] Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). invariant keypoints. International journal of computer
Aggregated residual transformations for deep neural vision, 60(2), 91-110.
networks. In Proceedings of the IEEE conference on [38] Bay, H., Tuytelaars, T., & Van Gool, L. (2006, May). Surf:
computer vision and pattern recognition (pp. 1492-1500). Speeded up robust features. In European conference on
[34] Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation computer vision (pp. 404-417). Springer, Berlin, Heidelberg.
networks. In Proceedings of the IEEE conference on [39] Wang, X., He, K., & Gupta, A. (2017). Transitive invariance
computer vision and pattern recognition (pp. 7132-7141). for self-supervised visual representation learning.
In Proceedings of the IEEE international conference on
computer vision (pp. 1329-1338).

Zhang 2020

Uploaded by

Copyright:

Available Formats

Zhang 2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Zhang 2020

Uploaded by

Copyright:

Available Formats

An End-to-end System for Pests and Diseases

ABSTRACT developing more intelligent remote-sensing image target

2. RELATED WORK 3. SYSTEM

2.2 Deep Convolutional Neutral Networks

2.3 Image Classification

2.4 Self-Supervised Learning

Figure 3. ClusterNet Architecture

Given a training dataset of N images denoted as

4.1 Image Classification

4.2 Ablation Studies

So, we finally choose the ClusterNet with 2*256 cluster module in

Figure 8. Recall rate verification result

You might also like