SSRN Id4107251
SSRN Id4107251
SSRN Id4107251
is based on model precision, time complexity, and targets. In object detection problems the manual
computational complexity. The key influences of this extraction of features from objects has many limitations.
study are given below. To detect objects and are characterized by
1) A detailed comparison of the well-known object automatic feature extraction, the contemporary deep
detection techniques for vehicle detection is presented. learning approaches performing well. LeNet is amongst
Among the pre-trained networks considered, Resnet the revolutionary mechanism connected to CNNs by
architectures achieved highest precision. LeCun et al. [10], Krizhevsky et al. [11] is designed. In
2) Performance comparison of SSD and YOLOv2 at ImageNet Bulky Scale Visual Recognition Challenge
different train test ratios shows that YOLOv2 is robust (ILSVRC) 2012, the AlexNet drastically reduced the
with limited training samples, as is the case in the dataset error [12]. In many computer vision difficulties such as
used in this paper. Moreover, the precision of YOLOv2 facial appearance appreciation [13], image cataloguing
degrades drastically as compared to that of SSD when [14], independent pouring [15], medical analysis [16],
training samples are reduced. computational forensics [17] and graphic tracking [18],
The rest of this paper is organized as follows. the state-of-the-art CNNs have gained significant
Section 2 presents the literature review on object attention. Deep learning has also been employed in
detection and vehicle detection. It also describes a remote sensing and satellite image examination for target
comparison of object detectors and pre-trained CNNs. detection [19]–[21], and aircraft detection [20], [22].
Section 3 presents the proposed methodology adopted. Using deep neural networks provides fast and accurate
Experiments and results are presented in Section 4 and means to predict the location and size of the object in an
lastly conclusions are drawn. image object detection.
In recent years, region-based object detection
II. LITERATURE REVIEW techniques including RCNN, Fast RCNN, and Faster
RCNN have gained much popularity in computer vision.
In this section, we review the conventional and These approaches divide the framework of object
state of the art object detection techniques and their detection into two stages. The initial stage deals with the
applications in remote sensing applications. generation of region proposals which may include an
2.1 Traditional Techniques object. The second stage takes into account the
Template-based, knowledge-based, object- classification for the objects proposed in the first stage
based image analysis (OBIA) and machine learning- and the fine tuning of bounding box coordinates. Faster
based object detection methods can be broadly RCNN is a popular region-based object detection
categorized in traditional object detection methods [1]. technique in many applications. Single-stage methods
For object detection in the past template matching has simplify the object detection task by modelling detection
been deployed [2]. Knowledge-based methods include as a regression problem. In comparison to region-based
previous information of the some ground rules, object’s methods, regression-based methods are simpler and
shape and geometry. For object detection Geometric more efficient. YOLO is a common regression-based
information of the target objects has been widely used object detection method, which uses a single backbone
[3]. These methods do not always have a well-versed CNN to straightly predict bounding boxes and class
algorithm to define prior knowledge. Additionally, the possibility from the image in one assessment. It divides
loose criteria can lead to an unacceptable number of the image into grids and for each grid cell, it predicts
false positives in complex scenarios [2]. OBIA methods bounding boxes with their class probabilities. Although
group similar pixels into segments prior to classification YOLO has been shown to yield real-time object
and exploit the properties of scale, shape and detection in many applications, fast object detectors with
neighborhood information [4]. Machine learning-based improved accuracy has remained a topic of interest to the
methods make use of handcrafted features such as Haar research community [23].
[5] and Local Binary Pattern [6] together with classifiers In this research article we have analyzed by
such as SVM [7] and Adaboost [8]. The performance of deep learning based techniques how to save the power
traditional approaches is limited for complex scenes in which is one of major problem all over the world. There
satellite image analysis even though the worthy is no system which when vehicles come and the street
enactment attained by the conventional practices in lights glow with full intensity and whenever vehicles
numerous computer vision applications [9]. pass through that area the lights glow with less intensity
2.2 Deep Learning-based Techniques and turn out as the model is shown in figures 1, 2 and 3
For remote sensing applications is still a in a sequence. We are going to construct an effectual
challenging problem even though the obtainability of street lighting system which safe power by taking
several computer vision performances in the literature, guideline from this study.
accurate and fast detection of vehicles and potential
2 This Work is licensed under Creative Commons Attribution 4.0 International License.
Figure 1: An overview of the proposed methodology for smart light system using deep learning based object detection
Figure 2: Presence of objects (cars) glows street light with full intensity
Figure 3: Presence of cars cause glow of light with full intensity and where there is no car the intensity of light glowing is
less and then light turn off
detection and confirmation is uninhibited and a single 2) VGG: it is used smaller convolutional filters
neural network is practical to the image. YOLO miens at compared to Alexnet and achieved better performance
an image and envisages the attendance and position of on the ImageNet database. In 2014, Oxford’s Visual
the objects of interest. Although YOLO has achieved Geometry Group (VGG) proposed the VGG model [28],
detection performance comparable to the region-based a 16-19 layers deep CNN.
methods with reduced detection time, it still struggles 3) Resnet: He et al. [29] planned the deep
with the detection of small-sized objects that appear in lingering networks (Resnet), is considerably profounder
groups and with varying aspect ratios of objects in the (up to 152 layers) than those beforehand used and a new
image [26]. To overcome this problem, improved YOLO sort of convolutional neural system construction. To ease
detectors have been proposed recently. In this paper, the training of the network by using residual or skip
YOLOv2 has been used. connections ResNet. In 2015, comprising ImageNet
2) SSD: Single Shot Multibox Detector (SSD) detection and COCO detection ResNet won numerous
[27] is single deep neural architecture used for object computer vision competitions.
detection. SSD divides the space of image into a grid of
boxes on varying scale and for multiple aspect ratios. IV. EXPERIMENTS AND RESULTS
The network computes the confidence scores for each
object label for the bounding boxes. These bounding 4.1 Platform Specifications
boxes are further refined to suit the shape of object. In The research is instigated by means of a
addition, SSD is known for unifying feature maps peculiar computer on Windows 10 operating system with
produced varying sizes. Intel i3 CPU of 3.3 GHz and NVIDIA GTX 1050TI
3.2 Backbone Networks GPU. MATLAB 2019A is the main software tool. By
As discussed earlier, the deep learning-based using pre-existing MATLAB tools, the object detectors
object detectors require a backbone CNN for feature and evaluators are implemented.
extraction. We use and compare the following well- 4.2 Dataset Specifications
known pre-trained CNN backbone networks. In the figure 4, the proposed vehicle detection
1) Alexnet: Alexnet [11], initiated the system is evaluated on dataset. The dataset include
revolution in computer vision by winning the ImageNet vehicle imagery through numerous decorations and
ILSVRC2012 competition by a large margin and is an placements. Along with the appearance of the vehicles in
eight-layer deep CNN was the first deep learning model. the images the resolution of the images varies.
4.3 Evaluation Metrics ratio of the overlying part of ground truth and predicted
To estimate and compare the accuracy of different bounding box to the union of the area as shown in
object detectors, mean average precision (mAP) is used equation (1). If the estimated IoU is greater than a
in this paper. In particular, mAP, as defined in Pascal threshold, the detected object is considered a true
VOC 2011 Challenge [30], is used. Therefore, in the first positive (TP) detection, otherwise, it is a false positive
step, the Intersection over Union (IoU) for each (FP). In this paper, the threshold value = 0:5 was used to
bounding box, detected with a confidence score greater compute the results.
than a threshold, α, and ground truth is computed. Denoting the actual number of vehicle instances
Assume A represents the detected bounding box and B in the image as N, precision and recall are defined in
denotes the ground truth bounding box. The IoU is the equations (2) and (3), respectively. As might be expected,
4 This Work is licensed under Creative Commons Attribution 4.0 International License.
selecting a smaller threshold α over the confidence score Figure 5. In general, mAP represents the area under the
would improve recall but may result in reduced precision. precision recall curve obtained by varying the threshold
This behavior can be observed in the precision-recall α.
curves obtained by varying the threshold α, as shown in
Figure 5: Performance of SSD and YOLOv2 architectures on varying train test ratios
typically utilize the features at the end of the backbone While evaluating different detectors and feature
architecture to detect objects. Going very deep can result extractors the calculation period for object recognition is
in result in feature maps with very low resolution due to also an energetic influence that is deliberated. Figure 6,
which the object detection performance can deteriorate. shows for automobile detection calculated on a single
Faster RCNN achieved the highest mAP of 0.80 using GPU for the given dataset the average detection time of
Efficientnet-B0 as the backbone feature extraction each deep learning-based object detector and each
network and also outperformed the rest of the detectors. backbone feature extractor.
Figure 6: mAP Values for SSD and Yolo architecture using various backbones
Results show that both YOLOv2 and v3 take proposal algorithm. Table 2, further shows a comparison
similar time to detect vehicle, on average. Furthermore, of YOLOv3 and SSD in terms of accuracy and
these object detectors outperform region proposal based computational complexity for vehicle detection using
object detectors by at least a factor of five. On the other Efficient Net B0 as backbone. Although SSD provides
end, RCNN turns out to be the slowest, with an average faster detection, it does not perform as well in mAP.
detection time of 47 seconds, due to its complex region
Table 2: MAP values for the SSD and YOLOV2 object detections using various features extraction network
Sr. No. Backbone/Architecture SSD YOLO
1 Alexnet 73.2 66.4
2 VGG-16 74.1 67.8
3 Resnet 18 76.4 70.1
4 Resnet-50 77.3 69.8
5 Inception v2 80.2 72.3
6 Efficient Net-B0 80.4 74.2
6 This Work is licensed under Creative Commons Attribution 4.0 International License.
Journal of Photogrammetry and Remote Sensing, 117, [17] Khan MJ, Khurshid K & Shafait F. (2019). A
11–28. spatiospectral hybrid convolutional architecture for
[2] Weber J & Lef`evre S. (2012). Spatial and spectral hyper-spectral document authentication. In:
morphological template matching. Image and Vision International Conference on Document Analysis and
Computing, 30(12), 934–945. Recognition (ICDAR), pp. 1097–1102.
[3] Huertas A & Nevatia, R. (1988). Detecting buildings [18] Yuan Y, Chu J, Leng L, Miao J & Kim BJ. (2020).
in aerial images. Computer Vision, Graphics, and Image A scale-adaptive object-tracking algorithm with
Processing, 41(2), 131–152. occlusion detection. EURASIP Journal on Image and
[4] Blaschke, T. (2010). Object based image analysis for Video Processing, 2020(1), 1–15.
remote sensing. ISPRS Journal of Photogrammetry and [19] Ding P, Zhang Y, Deng WJ, Jia P & Kuijper A.
Remote Sensing, 65(1), 2–16. (2018). A light and faster regional convolutional neural
[5] Lienhart R & Maydt J. (2002). An extended set of network for object detection in optical remote sensing
haarlike features for rapid object detection. In: images. ISPRS Journal of Photogrammetry and Remote
Proceedings International Conference on Image Sensing, 141, 208–218.
Processing, 1, I–I. [20] Zhang F, Du B, Zhang L & Xu M. (2016). Weakly
[6] Cheng J. Han J, Guo L & Liu, T. (2015). Learning supervised learning based on coupled convolutional
coarseto- fine sparselets for efficient object detection and neural networks for aircraft detection. IEEE Transactions
scene classification. In: Proceedings of the IEEE on Geoscience and Remote Sensing, 54(9), 5553–5563.
Conference on Computer Vision and Pattern [21] Long Y, Gong Y, Xiao Z & Liu Q. (2017). Accurate
Recognition, pp. 1173–1181. object localization in remote sensing images based on
[7] Inglada J. (2007). Automatic recognition of man- convolutional neural networks. IEEE Transactions on
made objects in high resolution optical remote sensing Geoscience and Remote Sensing. 55(5), 2486–2498.
images by svm classification of geometric image [22] Khan MJ, Yousaf A, Javed N, Nadeem S &
features. ISPRS Journal of Photogrammetry and Remote Khurshid K. (2017). Automatic target detection in
Sensing, 62(3), 236–248. satellite images using deep learning. Journal of Space
[8] Grabner H, Nguyen TT, Gruber B & Bischof H. Technology, 7(1), 44–49.
(2008). On-line boosting-based car detection from aerial [23] Tan M, Pang R & Le QV. (2020). Efficientdet:
images. ISPRS Journal of Photogrammetry and Remote Scalable and efficient object detection. In: Proceedings
Sensing, 63(3), 382–396. of the IEEE/CVF Conference on Computer Vision and
[9] Zou Z, Shi Z, Guo Y & Ye Y. (2019). Object Pattern Recognition, pp. 10781–10790.
detection in 20 years: A survey. arXiv preprint arXiv: [24] Jiang Q, Cao L, Cheng M, Wang M & Li J. (2015).
2019 1905.05055. Deep neural networks-based vehicle detection in satellite
[10] LeCun Y, Bottou L, Bengio Y & Haffner P. (1998). images. In: International Symposium on Bioelectronics
Gradient-based learning applied to document recognition. and Bioinformatics (ISBB), pp. 184 187.
Proceedings of the IEEE., 86(11), 2278–2324. [25] Redmon J, Divvala S, Girshick S & Farhadi A.
[11] Krizhevsky A, Sutskever I & Hinton GE. (2012). (2016). You only look once: Unified, real-time object
Imagenet classification with deep convolutional neural detection. In: Proceedings of the IEEE Conference on
networks. In: Advances in Neural Information Computer Vision and Pattern Recognition, pp. 779–788.
Processing Systems, pp. 1097–1105. [26] Garg D, Goel P, Pandya S, Ganatra A & Kotecha K.
[12] Russakovsky O, Deng J, Su H, Krause J, Satheesh S. (2019). A deep learning approach for face detection
Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, using yolo. In: IEEE Punecon, pp. 1–4.
Berg MA & Fei-Fei L. (2015). ImageNet large scale [27] Liu W, Anguelov D, Erhan D, Szegedy C, Reed S,
visual recognition challenge. International Journal of Fu CY & Berg AC. (2016). SSD: Single shot multibox
Computer Vision (IJCV), 115(3), 211–252. detector. In: European Conference on Computer Vision,
[13] Jeong D, Kim BG & Dong SY. (2020). Deep joint pp. 21–37.
spatiotemporal network (djstn) for efficient facial [28] Simonyan K & Zisserman A. (2014). Very deep
expression recognition. Sensors, 20(7), 1936. convolutional networks for large-scale image
[14] Morgan DA. (2015). Deep convolutional neural recognition. arXiv Preprint arXiv: 1409.1556.
networks for atr from sar imagery. In: Algorithms for [29] He K, Zhang X, Ren S & Sun J. (2016). Deep
Synthetic Aperture Radar Imagery, 9475, pp. 94750F. residual learning for image recognition. In: Proceedings
[15] Khan MJ, Khan HS, Yousaf A, Khurshid K & of the IEEE Conference on Computer Vision and Pattern
Abbas A. (2018). Modern trends in hyperspectral image Recognition, pp. 770–778.
analysis: A review. IEEE Access, 6, 14118– 14129. [30] Everingham M, Van L, Gool, Williams CK, Winn J
[16] Ahmad HM, Khan MJ, Yousaf A, Ghuffar S & & Zisserman A. (2010). The pascal visual object classes
Khurshid S. (2020). Deep learning: A breakthrough in (voc) challenge. International Journal of Computer
medical imaging. Current Medical Imaging, 16(8), 946– Vision, 88(2), 303–338.
956. [31] Jin X & Han J. (2010). K-medoids clustering.
7 This Work is licensed under Creative Commons Attribution 4.0 International License.