Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Deep Learning Techniques For Vehicle Detection and Classification From Images Videos - A Survey

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

sensors

Review
Deep Learning Techniques for Vehicle Detection and
Classification from Images/Videos: A Survey
Michael Abebe Berwo 1 , Asad Khan 2, * , Yong Fang 1 , Hamza Fahim 3, * , Shumaila Javaid 3 ,
Jabar Mahmood 1 , Zain Ul Abideen 4 and Syam M.S. 5

1 School of Information and Engineering, Chang’an University, Xi’an 710064, China;


2019024902@chd.edu.cn (M.A.B.); fy@chd.edu.cn (Y.F.); 2019024906@chd.edu.cn (J.M.)
2 School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China
3 School of Electronics and Information, Tongji University, Shanghai 200070, China; shumaila@tongji.edu.cn
4 Research Institute of Automotive Engineering, Jiangsu University, Zhenjiang 212013, China;
1000006198@ujs.edu.cn
5 IOT Research Center, Shenzhen University, Shenzhen 518060, China; syamms@email.szu.edu.cn
* Correspondence: asad@gzhu.edu.cn (A.K.); hamzafahim@tongji.edu.cn (H.F.)

Abstract: Detecting and classifying vehicles as objects from images and videos is challenging in
appearance-based representation, yet plays a significant role in the substantial real-time applications
of Intelligent Transportation Systems (ITSs). The rapid development of Deep Learning (DL) has
resulted in the computer-vision community demanding efficient, robust, and outstanding services to
be built in various fields. This paper covers a wide range of vehicle detection and classification ap-
proaches and the application of these in estimating traffic density, real-time targets, toll management
and other areas using DL architectures. Moreover, the paper also presents a detailed analysis of DL
techniques, benchmark datasets, and preliminaries. A survey of some vital detection and classifica-
tion applications, namely, vehicle detection and classification and performance, is conducted, with a
detailed investigation of the challenges faced. The paper also addresses the promising technological
advancements of the last few years.

Citation: Berwo, M.A.; Khan, A.;


Keywords: deep learning; vehicle detection and classification; CNN; activation function; loss function
Fang, Y.; Fahim, H.; Javaid, S.;
Mahmood, J.; Abideen, Z.U.; M.S., S.
Deep Learning Techniques for Vehicle
Detection and Classification from
1. Introduction
Images/Videos: A Survey. Sensors
2023, 23, 4832. https://doi.org/ Object detection and classification have received a lot of attention in recent years due to
10.3390/s23104832 the wide range of applications that are possible and the recent flurry of activity in computer
vision research. Most applications in ITS regarding vehicle detection and classification
Academic Editor: Ikhlas
focus a great deal of effort on traffic accident investigation, traffic flow monitoring, fleet
Abdel-Qader
and transport management, autonomous driving, and similar. Digital image processing
Received: 26 April 2023 techniques have been aggressively employed in recent years in vehicle shape detection,
Revised: 5 May 2023 color, speed, and post estimation. Simultaneously, computational power has increased.
Accepted: 10 May 2023 Nowadays, computer vision-based [1–3] platforms are equipped with high-core processing
Published: 17 May 2023 and graphics processing units (GPUs), which detect and classify objects to pursue real-time
implementations. Deep Learning (DL) and Machine Learning (ML) have exhibited vital
CV research applications. Deep ConvNets have various architectures of DL on CV topics,
such as image classification, object detection, object recognition, learning, vehicle tracking,
Copyright: © 2023 by the authors.
object pose estimation, and others.
Licensee MDPI, Basel, Switzerland.
An image is a two-dimensional digital distribution of pixel values designated by finite
This article is an open access article
distributed under the terms and
numbers. The pixels are denoted on the x–y spatial coordinate axis [4]. Digital image
conditions of the Creative Commons
processing is a term that describes the processing of an image or video frame, taken as
Attribution (CC BY) license (https:// input, and involving a set of phases with various parameters and experimental setups. For
creativecommons.org/licenses/by/ example, detecting a vehicle would imply that images or video frames clearly show its
4.0/). presence, as well as its location, in an image. Therefore, object detection can be defined as a

Sensors 2023, 23, 4832. https://doi.org/10.3390/s23104832 https://www.mdpi.com/journal/sensors


Sensors 2023, 23, 4832 2 of 35

means to locate samples of real-world objects in images. In this context, vehicle detection is
closely related to vehicle classification, since it involves defining the presence and location
of the vehicle in an image. However, the image is useless unless it is properly analyzed to
extract useful knowledge. Hand-crafted features (namely, Histogram of Oriented Gradient
(HOG) [5], Haar [6], and LBP [7]) are the most appropriate techniques to detect vehicles,
but they fail to provide a general solution, and the classifiers require some modifications to
fit various parameters. A shallow neural network is utilized as well for vehicle detection,
though its performance has not provided the desired quality. Handling this massive amount
of data necessitates the growth of an innovative method capable of performing quickly,
precisely, and consistently. Advancing the efficiency of vehicle detection and classification
accuracy, precision, and robustness through DL techniques, such as DCNNs, RCNNs,
and DNNs, improves the robustness of schemes in detecting and classifying vehicles from
images or video frames.
Rapid improvement and innovative ideas are utilized to improve the accuracy of detec-
tion and classification of DL schemes and to reduce computational costs during the training
and testing phases of DL schemes. Among these innovative approaches are those involving
the modification of DCNNs, transferring learning (TL), hyper-parameter optimization,
and implementation of image-preprocessing techniques (enhancement, scaling, median
filtering, fuzzy filtering, and Ensemble Learning (EL), in the proposed DL architectures.
For better understanding , the abbreviations are given in the Abbreviations section.
The main contributions of this survey article are as follow:
• We survey the methodologies, benchmark datasets, loss and activation functions, and opti-
mization algorithms used in vehicle identification and classification in deep learning.
• We survey the strategies for vehicle detection and classification studies in Deep Con-
volutional Neural Networks.
• We address the taxonomy of deep learning approaches and other functions in object
detection and classification tasks (as shown in Figure 1).
• We present promising technological future directions and tasks in improving deep
learning schemes for researchers.
This paper is organized into the following sections. Section 2 explains a detailed
analysis of DL techniques. Section 3 discusses the publicly available benchmark datasets
and performance evaluation metrics. Section 4 explains the application of activation and
loss functions in DL. Section 5 explains the optimization algorithms in DL. Section 6 explains
applications of DL in vehicle detection and classification and compares recently employed
techniques. Section 7 briefly discusses some promising future directions and tasks that
have been adopted to improve and optimize DL schemes and to solve the difficulties and
challenges that occur during training and testing of the models. Section 8 is the conclusion
of the survey.
Sensors 2023, 23, 4832 3 of 35

Figure 1. Taxonomy of the Deep Learning Approaches in Vehicle Detection and Classification Tasks.

2. Deep Learning Techniques


Object detection, recognition, and classification in computer vision are practically
helpful but technologically challenging. There are two main categories: multi-oriented
object detection and classification and single object recognition. DL approaches for object
detection and recognition and classification of images mainly focus on accurate object
recognition (improving detection and recognition performance), speed of testing, train-
ing, computational processes, and accurate object classification (minimizing the error
rate) [8,9].
Deep Learning deals with DNN architecture, where deep refers to figures of the hidden
layers, and its main objective is to resolve learning problems by copying the functioning
of the human brain [9,10]. Schemes employing DL have been developing and improving
consistently, as have adjustments to the model structure. Depending on the scheme, tuning
may be required or setups applied to upgrade the execution of the scheme. The designs of
DCNNs often involve the following essential elements:
Convolution Layer: The convolution layer is the initial layer that receives an input
image and extracts the features from that data. It utilizes small input data and learns
the data features by sustaining the correlation between values of pixels, which involves a
filter/kernel matrix and an image matrix, and the performance of a mathematical operation
to learn the features.
Activation Function: Linear or non-linear activation functions are used to monitor the
results of models. They can be linear or non-linear, depending on the function they monitor.
Pooling Layers: These employ subsampling and spatial pooling techniques to min-
imize some parameters without removing the critical parameter. Various methods of
pooling are employed, including average, sum, and maximum approaches.
Fully Connected (FC) Layer: The final few layers are FC layers. After the final pooling
or CNN layer, the output feature maps are mainly flattened (vectors) and used as input to
FC layers. A Deep Nets Architecture is depicted in Figure 2.
Sensors 2023, 23, 4832 4 of 35

Figure 2. A Deep Nets Architecture.

2.1. Techniques
In this subsection we discuss deep learning techniques.

2.1.1. Traditional Detection Methods


In more recent years, object recognition/detection and classification have been hot
research topics in computer vision-based applications. Various objects in various envi-
ronments may be challenging to detect, and, therefore, to classify and identify, due to the
following factors: weather, lighting, illumination effects, size of the objects, inter-class varia-
tions, intra-class variations, and other factors. In recent studies, many extracted AI features
have been employed to classify objects. The traditional feature-based object recognition
and classification approaches consist of three systems (see Figure 3):
• Region selection
• Feature extraction, and
• Classification.
The most common traditional feature-based architectures in the literature for vehi-
cle detection and recognition and classification are the Histogram of Oriented Gradient
(HOG) [5], Haar [6], and LBP [7].

Figure 3. Traditional Feature-based object Recognition and Classification Architecture.

Haar features are calculated by adding and subtracting the sums of rectangles and the
differences across an image patch. As this was highly efficient at calculating the symmetry
structure in detecting vehicles [11], it was ideal for real-time detection. The Haar feature
vector and the AdaBoost [12,13] were widely used in CV to detect objects in a variety of
feature applications, including vehicle recognition [11].
HOG features are extracted in the following phases:
• Evaluating the edge and discretizing the image;
• Removing edge sharpness.
The HOG feature vector integrated with the Support Vector Machine (SVM) classifier
has been widely employed to recognize object orientation, i.e., on-road vehicle detec-
tion [14,15]. The HOG–SVM [16] performed admirably in multi-vehicle detection tasks.
In addition, a blend of HOG [5] and Haar [6] was employed for vehicle recognition, detec-
tion, and tracking [17].
Sensors 2023, 23, 4832 5 of 35

Local Binary Pattern (LBP) [7] features have performed better in different applications,
including texture classification, face recognition, segmentation, image retrieval, and surface
crack detection. The cascade classifier (Haar–LBP–HOG feature) [18] is detects vehicles with
bounding boxes. In addition to the previously mentioned features and classifiers for vehicle
detection and classification problems, statistical architectures, based on horizontal and
vertical edge features, were proposed for vehicle detection [19], side-view car detection [20],
online vehicle detection [21], and vehicle detection in severe weather using HOG–LBP
fusion [22].

2.1.2. CNN-Based Two-Step Algorithms


A two-step object detector, or the region-based approach, comprises two steps to
process an image:
• Produce a series of candidate frames or extract region proposals from the scene;
• Classify and regress the generated candidate frames to improve the architecture’s
detection accuracy.
The region-based approach has the properties of high localization and performance,
slower speed, and high computational cost during training. Figure 4 displays the archi-
tecture of a two-step object detector. Researchers have proposed several two-step object
detector algorithms and these have been employed for vehicle detection and classification
in more recent years. They are explained as follows:

Figure 4. Basic Architecture of Two-step Detector.

R-CNN: Girshick et al. [23] proposed an R-CNN or region-based ConvNet two-step


object detector architecture. In [23,24] AlexNet was employed as the backbone model of
the detector. It can increase the detection accuracy of objects over that of traditional object
detection algorithms, such as HOG [5], Haar [6] and LBP [7] feature extraction. The R-CNN
has four systems to accomplish the tasks. The operation of the algorithm is as follows:
• Produce categorical-independent region proposals;
• Extract a fixed-length feature vector from each region proposal;
• Compute the confidence scores to classify the object classes using class-specific support
vector machines;
• Predict the bounding-box regressor for accurate bounding-box predictions, once the
object class has been classified.
The authors adopted a selective search approach [25] to search for parts of the image
having higher probability. Convolutional neural networks (ConvNets) were used to extract
a 4096 dimensional feature vector from each proposed region. There had to be an exact
match in length between the region’s proposed features and the input vectors for the FC.
For the model, the authors used a fixed pixel size of 27 × 27, regardless of the candidate
region’s size or aspect ratio. When using R-CNN, the final FC is linked to the M + 1
Sensors 2023, 23, 4832 6 of 35

classification layers (hence, M represents the number of object classes and 1 represents the
background) to perform the final object classification. Optimizing convolution parameters,
such as IoU, is accomplished with SGD. An IoU of less than 0.5 is considered incorrect
for a region proposal; otherwise, it is correct. In R-CNN, without sharing computation,
the region proposal and classification problems are carried out independently. However,
R-CNN has problems concerning computational cost and training time for classification.
To solve the problem of too much time required in the training process, convolutional
feature maps with high resolution can be generated at a low cost using the Fast R-CNN
architecture proposed by Girshick [26].
Fast R-CNN: The Fast R-CNN [26] network takes as input an entire image and a set of
object proposals. It follows the following specific steps:
• Generate a convolution feature by using various convolution and max-pooling layers
on the entire image;
• Extract a fixed-length feature vector from the feature map for each object proposal of
Region of Interest pooling layers;
• Feed each feature vector into a sequence of FC layers to generate softmax probability
predictions over M object classes plus 1 background (M + 1). The other layer generates
four real-valued n. Fast R-CNN utilizes a streamlined training process with a fine-
tuning step that jointly optimizes a softmax classifier and Bbox regressors.
Training a softmax classifier, SVMs, and regressors in separate stages accelerates
the training time over the standard R-CNN architecture. The entire process architecture
includes loss, the SGD optimizer, the mini-batch sampling strategy, and BP through the
RoI pooling layers. However, Fast R-CNN uses a selective search approach over the
convolution feature map to explore its pooling map, increasing its run time. Using a new
region proposal network (RPN), Shaoqing et al. [27] proposed a faster RCNN architecture
to improve the Fast RCNN network in terms of run time and detection performance in
order to better estimate the object region at various aspect ratios and scales.
Faster R-CNN: In terms of operation time and detection performance, the faster
RCNN [27] is a more advanced variant of the RCNN. Instead of the traditional method,
selective search replaces RPN’s outstanding prediction of object regions at various scales
and aspect ratios. Anchors are placed at each convolutional feature location to create a
variety of region proposals. The anchor box in Faster RCNN has three different aspect
ratios and three different scales.
It comprises four systems to achieve object detection tasks: candidate region produc-
ing, feature extraction, classification, and location fine-tuning. In the RPN architecture,
the feature map is computed using a sliding window of 3 × 3, which is then output to the
Bbox classification and Bbox regression layers. Each point on the feature map is traversed
by the sliding window, which places z anchor boxes where they are needed. The feature
map’s z anchor boxes are used to extract its elements.
R-FCN: The two-step object detection architecture can be categorized into two distinct
groups. One group represents classification networks like GoogleNet [28], ResNet [29],
AlexNet [24], VGGNet [30]. Their computation is shared by all ROIs and an image test
is conducted using one forward computation. In the second group, no computation is
shared to all ROIs since it aims to classify the object regions. Dai et al. [31] proposed the
R-FCN architecture of an improved version of the faster RCNN and partially eliminated
the problem of position sensitivity and position variance by increasing the sharing of
convolutional parameters. For the RFCN algorithm, the primary goal is the creation of
“position-sensitive score maps.” If the ROI is not part of the object, it is determined by
comparing it to the ROI sub-region, which consists of the corresponding parts (s × s). There
is a shared convolutional layer at the end of the RFCN network’s network.
An additional layer of dimensional convolution (4 × s2 ) is applied to the score maps
to produce class-independent Bboxes. A softmax is used to calculate the results, after
averaging the s2 scores, to produce (M + 1) dimensional vectors.
Sensors 2023, 23, 4832 7 of 35

A comparison study was carried out on the most widely utilized two-step object
detectors on both the COCO dataset [32] and the PASCAL VOC 07 [33] dataset. In [34],
experimentation showed that RCNN achieved 66% of the mAP on the PASCAL VOC 07
dataset [33], while Fast RCNN achieved 66% of the same dataset. In addition, the Fast
RCNN network was nine times faster than the standard RCNN network. Wang et al. [35]
conducted a comparative study on three networks, namely, fast RCNN, faster RCNN,
and the RFCN, on two publicly available datasets, i.e., the COCO [32] dataset and the
PASCAL VOC 07 [33] dataset. On the COCO test dataset, faster RCNN improved detection
accuracy by 3.2% compared to slow RCNN. Furthermore, the tasking positions on both
RFCN and the faster RCNN on both datasets were compared. The experimental results
revealed that RFCN outperformed the faster RCNN with superior detection accuracy and
less operational run time. Table 1 displays the fundamental advantages and disadvantages
of the most widely utilized two-step object detectors.

Table 1. Summary of the Two-step Algorithms in Object Detection and Classification Applications.

Algorithms Advantage Disadvantage


Utilizes selective search approach to
High computational time.
produce regions.
Slow speed because of using several
RCNN [23] Extracts 2000 regions from each
networks for generating predictions.
image than the standard
Difficult to detect small-scale objects
CNN algorithm.
Each image is passed only once to
the CNN algorithm, and feature
Requires a high volume of the
maps are extracted.
Fast RCNN [26] real-time dataset.
Selective search approach is
High computation time.
employed on these maps to
produce predictions.
Requires several passes using a
Replaces the selective search single image to extract all the
approach with RPN algorithm, object classes.
Faster RCNN [27]
which makes the algorithm The performance of the algorithms
much faster depends on how the preceding
schemes have performed.
Uses position-sensitivity score maps
to solve the position sensitivity
problem of object classification
and detection. R-FCN has a competitive mAP but it
RFCN [31]
Has less computational time is lower than that of Faster R-CNN.
compared to the rest of the
algorithms, due to its property of
sharing every convolutional layer.

2.1.3. CNN-Based Single-Step Algorithms


There is no region proposal phase for the classification or detection of object classes
in a single-step algorithm, and the prediction results are directly obtained from the image.
In this algorithm, the input image is sampled at various positions uniformly, using different
aspect ratios and scales, and then the CNN layer is sampled to extract features to precisely
execute regression and classification. The most notable merits of the models are that they are
easier to optimize, suitable for real-time applications, and faster. There is no region proposal
phase for the classification or detection of object classes in a single-step algorithm, and the
prediction results are directly obtained from the image. In this algorithm, the input image
uses a variety of aspect ratios and scales, and the CNN layer is sampled to extract features
that can be used to accurately perform regression and classification. The most notable
merits of the models are that they are easier to optimize, suitable for real-time applications,
and faster. Figure 5 displays the framework of the Basic Architecture of One-step Detector.
Sensors 2023, 23, 4832 8 of 35

Numerous single-step object detector algorithms have been utilized for various applications,
such as, among others, real-time vehicle object detection, vehicle recognition, in the last
couple of years. Some of the most widely employed algorithms are the following: SSD [36],
RetinaNet [37], YOLO [38], YOLOv2 [39], YOLOv3 [40], YOLOv4 [41], and YOLOv5 [42].

Figure 5. Basic Architecture of One-step Detector.

RetinaNet Algorithm: Lin et al. [37] proposed a RetinaNet algorithm that performs
the focal loss as a classification loss. It solves the class imbalance between the positive
and negative samples, which minimizes the prediction accuracy. The author introduced a
focal loss to minimize the weight loss by avoiding several negative samples given in the
background. The algorithm utilizes the ResNet [43] model as a backbone and FPN [44]
as feature extraction architecture. It consists of two processes: generating a set of region
proposals via FPN and classification of each candidate.
SSD Algorithm: Liu et al. [36] proposed an SSD algorithm based on a feedforward
convolutional architecture that generates a fixed-size sum of bounding boxes and scores for
existing object class samples, followed by an NMS stage to generate the detection process.
The SSD algorithm utilizes a VGG16 [43] architecture as a backbone for feature extraction
and six more convolutional layers for detection. It generates sequences of feature maps of
various scales, followed by a 3 × 3 filter on each feature map to generate default Bboxes. It
only detects at the top layers to get the best prediction Bbox and class label.
YOLO Algorithm: The YOLO algorithm [38] is a CNN-based object detection one-step
detector that was designed after two-step object detection became the faster RCNN detector.
The YOLO algorithm is most applicable for real-time image detection. It has a few region
proposals per image compared to the faster RCNN. It utilizes a grid size of (t × t) to split
the images into grid features for image classification. Grid cells can be used to estimate
Bbox bounding boxes and C class probabilities for C object classes for each box. For each
box, the probability (P) and the IOU between the ground truth and the box are considered.
The YOLO algorithm has 2 FC layers and 24 convolution layers. However, the algorithm
has the problem of weak object localization, which affects the classification accuracy.
YOLOv2 Algorithm: The YOLOv2 algorithm [39] is an improved version of the YOLO
algorithm in detection precision and offers higher speed than the standard YOLO algorithm.
It contains 6 consecutive tasks to efficiently perform the detection process, namely the BN,
high-resolution classifier, convolution with anchor box, various aspect ratios and scales of
the anchor box, fine-grained feature techniques, and multi-scale training.
The training process of the YOLOv2 algorithm [39] is carried out through the SGD
optimizer, which employs a mini-batch. For example, mean, mini-batch, and variance are
calculated and utilized for activation purposes.
Then, every mini-batch activation is normalized using the standard deviation of 1 and
0 mean. In the end, all elements in every mini-batch are sampled using an uniform distri-
bution. This process is carried out through techniques of batch normalization (BN) [45].
Sensors 2023, 23, 4832 9 of 35

It generates activation of uniform distribution to speed up its operation to obtain conver-


gence. The YOLOv2 model uses a high-resolution classifier as a backbone to maximize the
input resolution into (448 × 448), and classification fine-tuning is implemented for image
resolution with 10 epochs to improve its map by 4%.
Moreover, techniques of convolution anchor box are also utilized to generate region
proposals to predict the object-class score and class for each estimated Bbox, leading to an
improvement of its recall by 7%. Furthermore, the model uses the anchor box’s size and
aspect ratio prediction technique with K-means clustering. Fine-grained features for small
objects and multi-scale training with image sizes of 320, 352, ..., 608 improve the detection
of objects of different sizes.
YOLOv3 Algorithm: The YOLOv3 Algorithm [40] is another improved version of the
YOLO Algorithm. It utilizes the DarkNet53 model for feature extraction and employs a
multi-label classification with overlapping patterns for the training process. It is primarily
notable for object detection in complex scenes. In addition, in the YOLOv3 Algorithm, vari-
ous sizes of three feature maps are utilized to predict the Bbox. The last convolution layer
is used to produce a three-dimensional tensor that consists of objectness, class predictions,
and Bbox.
YOLOv4 Algorithm: Single-step object detection algorithms, such as the YOLOv4
Algorithm [41], combine the properties of YOLO, YOLOv2, and YOLOv3 and achieve the
current optimum in terms of both accuracy and speed. The residual system receives the
feature layer and outputs the higher-level feature information. Algorithms like YOLOv4
are composed of a 3 structure called the “Neck”, “Backbone”, and “Prediction” sections.
The SPPNet and PANet form the neck. Features in the SPPNet are concatenated and then
extremely pooled by supreme cores of various scales in the feature layer. To increase
the receptive field of the architecture, the pooled result is appended and convolved 3
times and the concatenated feature layers are up-sampled after concatenating with the
SPPNet and Backbone. The process was cycled to up-sample and down-sample with
feature layers to achieve CSPDarkNet53 for feature fusion and compression of height
and width. Then, they are layered on top of each other to create new combinations of
features. The features extracted from the model can be used to make predictions according
to the prediction scheme. Prediction results from a network are filtered out using the
Non-maximal Suppression (NMS) [46] efficient technique.
YOLOv5 Algorithm: The YOLOv5 algorithm utilizes CSPDarkNet as a backbone for
the feature extraction model to extract feature information from the input data. Compared
to the other variants of the YOLO algorithm, it has better capability to detect small objects,
excellent detection accuracy, and is more adaptable and faster. It has 4 modules. The CSPNet
architecture eliminates the gradient information duplication problem of model optimization
in massive models and combines the gradient variation from the previous to the final into
feature maps. Consequently, decreasing the volume of architecture FLOPS values and
parameters causes the improved accuracy and speed of the model. However, it decreases
the size of the architecture. The detection efficiency depends on the computation of the
frame selection area to improve the model, which proposes the Fcos approach [47].
The model employs the CSPDarkNet feature extraction model to extract image features
competently and utilizes Bottleneck CSP instead of a residual shortcut link to strengthen
the description of the image features. The neck system is mainly employed to produce a
feature pyramid. The feature pyramids can help the network find objects of different sizes,
so as to find the frame object of different scales and sizes.
The CNN-based object detector has been applied to many DL-based applications.
Its purpose is commonly illustrated as an effective, efficient object detection, recognition,
and classification application with fewer error rates. The detector has been applied to face
mask recognition [48,49], real-time vehicle detection [50], vehicle classification [51], off-
road quad-bike detection [52], pedestrian detection [53], medical image classification [54],
automotive engine crack detection [55] and so on.
Sensors 2023, 23, 4832 10 of 35

Recent studies show that the CNN-based object detection algorithms (single-step
and two-step object detectors) are gaining momentum in vehicle detection/recognition
and classification. The algorithms are employed to detect and classify object classes from
images and videos. Kausa et al. [56] utilized both single and two-step object detector
approaches for two-wheeled and four-wheeled vehicle detection from publicly available
datasets. Vasavi et al. [57] also applied integrated YOLO and RCNN algorithms for vehicle
detection and classification from high-resolution images. In YOLOv3, a faster RCNN
algorithm for detecting vehicles at night, using tail light images, was implemented , by [58].
It is essential to understand some of the object detection algorithms’ strengths and
limitations (see Tables 1 and 2). The detection and classification performance of the model is
affected by various factors. Many studies have aimed to fix or decrease errors in predicting
the exact object class and to ensure the algorithms work better.

Table 2. Summary of the Single-step Algorithms in Object Detection and Classification Applications.

Networks Advantage Disadvantage


Simple neural network. Low detection accuracy in
SSD
low computational expensive complex scenarios.
Enhanced detection precision on small objects.
RetinaNet Requires real-time detection.
suitable for class imbalance training process
Fast compared to the two-step object detectors. Poor performance for a set of
global trainable module stops optimization. small object classes, due to
YOLOv1
offers higher generalization when evaluating its grid set-up.
another dataset. high localization error.
It dramatically enhances the speed and accuracy of
object detection.
It is easy to detect objects with grids and
YOLOv2 Complex Training
boundaries prediction, and also it helps in
predicting tiny objects or objects that are very far
in the image
Fast, robust predictions of objects in real-time. Worst to detect medium and
YOLOv3
computational inexpensive. large objects.
Excellent detection accuracy. Poor small target detection
YOLOv4
better training optimization accuracy.
Outstanding detection/recognition accuracy.
low false detection rate.
Has both global maxima and
YOLOv5 works efficiently.
local minimal.
low computational cost.
easily to set up.

We summarized the performance of the one-step and two-step object detectors on the
COCO dataset and PASCAL VOC. The performance of deep learning-based object detection
is affected by a series of elements, such as the following: feature extraction classifiers, type
of backbone, image size and scale, training strategy, loss function, activation function,
number of region proposals, etc. These elements make it challenging to compare several
algorithms without a shared benchmark background. Table 3 shows the performance of
the various algorithms employed in object detection tasks. The algorithms were compared
using various performance evaluation metrics, such as FPs and average precision (AP)
at inference time. The AP0.5 represents the average precision of the object classes when
the estimated Bbox has IoU > 0.5 with ground truth and the AP0.5−0.95 in 0.5 steps. The
performances of the selected models were assessed on the same-sized input, where possible,
to offer flexibility between inference time and detection accuracy.
Sensors 2023, 23, 4832 11 of 35

Table 3. The summary of Performances of the Various Algorithms Employed in Object Detection.

Networks Backbone Dataset Image Size AP@0.5 AP@0.5 to 0.95 FPs


RCNN AlexNet PASCAL VOC 12 224 - 58.50 0.02
Fast RCNN VGG-16 PASCAL VOC 12 variable - 65.70 0.43
Faster RCNN VGG-16 PASCAL VOC 12 600 - 67.00 5
R-FCN ResNet-101 COCO 12 600 31.50 53.20 3
RetinaNet ResNet-101-FPN COCO 12 400 31.90 49.50 12
SSD VGG-16 COCO 12 300 23.20 41.20 46
YOLOv1 GoogleNet PASCAL VOC 12 448 - 57.90 45
YOLOv2 DarkNet-19 COCO 12 352 21.60 44.00 81
YOLOv3 DarkNet-53 COCO 12 320 28.20 51.50 45
YOLOv4 CSPDarkNet-53 COCO 12 512 43.00 64.90 31

3. Benchmark Datasets and Performance Evaluation Metrics


In this section, we describe the different benchmark datasets and performance evalua-
tion metrics.

3.1. Benchmark Datasets


This section provides an overview of the common publicly available vehicle datasets
utilized in vehicle detection, classification, and recognition tasks. Creating a large dataset
volume under different lighting and weather conditions is challenging in vision-based ar-
chitectures. The most famous vehicle datasets and benchmarks have been available for the
last ten years, including the BIT vehicle dataset, comprehensive car datasets, KITTI bench-
mark datasets, Stanford car dataset, Tsinghua-Tencent Traffic Sign dataset, MotorBike7500,
Tsinghua-Daimler Cyclist benchmark, etc.
BIT Vehicle Dataset: The challenge with the BIT Vehicle Dataset is the time-consuming
effort required to speed up the growth of intelligent transportation system (ITS) vehi-
cle type classification (VTC). In appearance-based tasks, it has been utilized in several
applications, such as speed estimation, illegal vehicle detection, traffic flow, fleet manage-
ment, and incident detection. It contains six object classes for every 150 vehicle to provide
900 vehicles: buses, microbuses, minibuses, SUVs, sedans, and trucks. Various conditions
of illumination, time, color, viewpoint, and scale are applied. It introduced a classification
accuracy of 93.8% and assessed the performance of the proposed model with an unlabeled
vehicle over random values to capture rich discriminative information about vehicles
for VTC.
CompCars Dataset: The Comprehensive Cars Dataset is one of the publicly available
datasets. Images of both web and surveillance nature are included in the data set. It
was launched in 2015, and its popularity has improved in the real-world application of
appearance-based tasks. The web-nature scenario consists of 136,727 images that capture
the entire part of the car and 27,618 car parts with labels and viewpoint. At the same time,
the surveillance-nature data contains 44, 481 images captured from the front view and
annotated with Bbox, model, and color. The CompCar Dataset introduces four unique
features compared to the other currently available datasets, such as car hierarchy, viewpoint,
car attributes, and car parts.
KITTI Benchmark Dataset: The KITTI Benchmark Dataset [59,60] is one of the most
widespread datasets used in autonomous traffic scenarios, consisting of various modalities,
namely, high-resolution RGB, 3D laser scanner, and grayscale stereo cameras. Despite its
popularity, the dataset does not have ground truth for segmentation purposes. However,
many researchers have manually labeled the images to fit their needs for experimentation.
Alvarez et al. [61,62] provided the ground truth of the dataset for 323 images from road de-
Sensors 2023, 23, 4832 12 of 35

tection challenges with three object classes: road, sky, and vertical. Further, Zhang et al. [63]
labeled 252 captured RGB images from Velodyne scans and the tracking challenges for
ten object classes: sky, car, building, vegetable, fence, cyclist, sidewalk, road, pedestrian,
and sign pole. Ros et al. [64] also labeled 216 images from two odometer challenges from
eleven object classes: sky, car, road, fence, bicyclist, sign, building, sidewalk, pedestrian,
pole, and tree.
Stanford Car Dataset: The Stanford Car Dataset [65] is one of the publicly available
car datasets for extensive research purposes. It contains 8144 training sample images and
8041 unseen images with object classes of 196 car types. It was launched in 2013, and its
publicity has increased in object class detection and scene. Authors has extensively re-
searched 3D object representations outperforming their 2D counter-parts for fine-grained
categorization, and illustrated their effectiveness for estimating 3D geometry from images.
MotorBike7500 Dataset: The MotorBike7500 Dataset [66] is one of the benchmark
motorcycle image datasets. It contains 7500 annotated images captured under real-time
road traffic scenes with 60% occlusion rate. The images were resized to 640 × 364 pixels
with 41,040 region of interest-annotated objects. The ground truth describes the frames
covered by the objects, class, name, height, and width of the Bbox surrounding the ob-
ject and provides an Id, which introduces a performance of 92% of the schemes on the
benchmark dataset.
MotorBike10000 Dataset: The MotorBike10000 Dataset [66] is the extension of Motor-
Bike7500 benchmark motorcycle image dataset. It contains a range of 10,000 annotated
images captured under windy conditions with 60% occlusion rate. The images were resized
to 640 × 364 pixels with 56,975 RoI annotated objects. The ground truth produced describes
the frames covered by the objects, class, name, height, and width of the Bbox surrounding
the object and provides an Id, which introduces the performance of 92% of the schemes on
the benchmark dataset.
Tsinghua–Tencent Traffic Sign Dataset: The Tsinghua–Tencent Traffic Sign (TTTS)
Dataset [67] consists of 30,000 samples of traffic signs and 100,000 images. The pictures are
captured under diverse climatic conditions and lighting.
Tsinghua–Daimler Cyclist Benchmark: The Tsinghua–Daimler Cyclist Benchmark
(TDCB) [68] provides a benchmark dataset for cyclist detection with six object classes:
Mopedrider, pedestrian, Tricyclist, Cyclist, Wheelchair user, and Motorcyclist. It consists of
Bbox of training, testing, and validation datasets of 16,202, 13,163, and 3045, respectively.
Experimental results show an average precision of 89% for the easy case, which gradually
reduces when the difficulty increases.
Cityscapes Dataset: The Cityscapes dataset [69] includes several collections of street
scenes 20,000 and 500 weakly marked and full-length pictures from 50 different cities under
diverse seasons, respectively.
GRAM Road-Traffic Monitoring (GRAM–RTM) Dataset: The GRAM–RTM Dataset [70]
consists of video clips recorded under diverse conditions and on several platforms using
surveillance cameras. It is widely utilized to evaluate the architecture of tracking several
vehicles labeled in different classes, such as large trucks, cars, trucks, and vans. Each video
clip contains 240 diverse object classes.
MIO–TCD Dataset: The MIO–TCD Dataset [71] is a dataset widely utilized for mo-
torized traffic analysis. It consists of 11 object categories, such as motorcycles, bicycles,
pedestrians, cars, buses, and trucks, with 786,702 labeled images captured under various
times, seasons, and periods using traffic surveillance cameras.
UA–DETRACT Benchmark Dataset: The UA–DETRACT Benchmark Dataset [72]
contains 100 video clips recorded at 24 diverse locations with diverse traffic patterns
and conditions, such as traffic crossings, highways, and T-junctions, using a Canon EOS
550D camera.
LSVH Dataset: The LSVH Benchmark Dataset [73] consists of 16 video clips of vehicles
with large-scale variations captured using surveillance cameras under diverse weather,
scene, time, and resolution conditions.
Sensors 2023, 23, 4832 13 of 35

COCO Dataset: The Microsoft COCO Benchmark Dataset [32] consists of 91 object
classes of 328,000 images with 2,500,000 labeled samples. It is also significantly more
prominent in several samples per class than PASCAL VOC [33].
PASCAL VOC Dataset: The PASCAL VOC Benchmark Dataset [33] is a publicly avail-
able dataset that contains annotated images collected from the Flickr photo-sharing website.
It is a widely utilized dataset in object detection and classification to evaluate architectures.
ImageNet Dataset: The ImageNet Benchmark Dataset [74] consists of 80,000 synets of
WillNet with an average of 500–1000 clean and full resolution images, having 12 subtrees
with 5247 synets and 3.2 million images.
Caltech101 Dataset: The Caltech101 Benchmark Dataset [75] consists of images of
101 object classes. It is widely utilized in object recognition tasks.
Caltech256 Dataset: The Caltech256 Benchmark Dataset [76] is a series of the Cal-
tech101 benchmark dataset which maximizes the object classes into 256 to improve the
performance of multi-class object recognition with few training samples.
DAWN Dataset: The purpose of the DAWN Dataset [77] is to explore the effectiveness
of vehicle detection and classification approaches of a wide range of natural images for
traffic situations in the cross-generalization of adverse environmental conditions. It shifts
substantially in terms of vehicle category, size, orientation, pose, illumination, position,
and occlusion. Furthermore, this dataset demonstrates a systematic preference for traffic
scenes during bad winter weather, heavy snowfall, sleet rain, hazardous weather, sand and
dust storms.

3.2. Performance Evaluation Metrics


Object detectors and classifiers use several performance measures to quantify the
performance of detectors and classifiers, namely, Precision (P), Frame per Second (FPS),
Recall (R), True Positive Rate (TPR), False Positive Rate (FPR), Average mean Precision
(AmP), intersection over union (IoU), average precision (AP), Accuracy, F1-Score, and Area
Under Curve (AUC). The existing vehicle detection and classification approaches, as well as
their corresponding performance measures, are shown in Table 4. In Table 5 demonstrates
the various types of performance evaluation metrics and their mathematical equations.

Table 4. Existing Works’ Performance outcomes.

References Approach Dataset Evaluation Metrics


Yolov4.
Zuraim et al. [78] Own dataset. 82.08% of average precision.
DeepSORT.
Modified YOLOv3 VEDAI
Xu et al. [79] 91.72% of average precision.
classifier. dataset.
DETRAC
Liu et al. [80] BFEN + SLPN + PNW. benchmark 88.71% of mAP.
dataset.
83.92% average precision in the
Soft NMS algorithm. KITTI dataset. KITTI dataset.
Nguyen et al. [81]
Faster RCNN classifier. LSVH dataset. 64.72% average precision in the
LSVH dataset.
85.22% average precision in the
KITTI dataset.
Faster RCNN + KITTI dataset.
Dai et al. [82] PASCAL2007
SSD classifier. 64.83% average precision in the
car dataset.
PASCAL2007 car dataset.
Sensors 2023, 23, 4832 14 of 35

Table 4. Cont.

References Approach Dataset Evaluation Metrics


88.95% average precision in the
KITTI dataset.
Faster RCNN with KITTI dataset.
Nguyen et al. [83] PASCAL2007
FPN backbone. 78.84% average precision in the
car dataset.
PASCAL2007 car dataset.
Fan et al. [84] Faster RCNN classifier. KITTI dataset. 83.36% verage precision.

Table 5. Summary of Various Performance Evaluation Metrics.

Evaluation Metrics Mathematical Formulae


Precision TP
P= TP+ FP
Recall TP
R= TP+ FN

Frame Per Second n∗ FPS[n−1]


FPS[n] = FPS[n−1]∗ FPS[n−1]+n−1
T
area( Bbox p Bbox g )
Intersection over Union (IoU) J ( Bbox p , Bbox g ) = S
area( Bbox p Bbox g )

Average mean Precision mAP = 1


n ∑in=1 APi
Average Precision AvP = ∑n ( Ren+1 − Ren ) maxRe: R̂e≥ Ren+1 P( R̂e)
True Positive Rate TP
TPR = TP+ TN
False Positive Rate FP
FPR = FP+ TN
Accuracy TN + TP
Accuracy = TP+ FP+ TN + FN
F1-Score 2× R × P
F1-Score = R+ P
1 FPR TPR
Area Under Curve AUC = 2 − 2 + 2

4. Activation Functions in Deep Learning


This section presents the various types of activation functions and recent advances
in existing activation functions employed in DL and ML applications. It highlights recent
trends in utilizing the activation functions for deep learning-based vehicle detection, classi-
fication, and recognition. The most common activation functions used in Deep Learning
architectures are shown in Figure 6.
Activation functions can be linear or non-linear, depending on the function they
convey when monitoring the results of networks. This technique can be used for a variety
of purposes. As an example of how it can be deployed, consider image classification, image
segmentation, and machine translation, as well as finding objects such as cars and other
vehicle types.
Most of the time, the affine transformation is used to conduct linear mapping from an
input function to an output function in the hidden layers of the linear net architecture. The
data x transformation is described in the following way, as shown in Equation (1).

f ( x i ) = w T + bi (1)

Data input, weight, and biases are all represented by xi , w, and bi , respectively. Ad-
ditional computation is then necessary to translate these linear outputs into non-linear
outputs for the AF, notably to learn patterns in data from the mapping from Equation (2).
These net architectures produce the following results:

Y = (w1 x1 + w2 x2 + w3 x3 + ...wd xd + bi ) (2)

Each layer’s output is fed into a subsequent layer until the final output is achieved,
but, by default, they are linear. For each net, the anticipated output determines the type
Sensors 2023, 23, 4832 15 of 35

of AF deployed. Since the output is linear, non-linear results are not an issue. Transfer
functions (TF) are applied to the outputs of linear net architectures to generate addi-
tional computation for the converted non-linear outputs. Mathematically, it is defined in
Equation (3).

Y = ψ(w1 x1 + w2 x2 + w3 x3 + ...wd xd + bi ) (3)

where, ψ is the activation function coefficient.


The requirements for these activation functions include transforming the linear input
signals and net architectures into non-linear output signals, which helps the learning of
high-order polynomials outside one degree for deeper nets. Generally, the activation
function maintains the dying gradients’ values, and the exploding gradient rises because of
the derivative terms. These are achieved using various mathematical functions employed
for network computing.
Table 6 presents a summary of the most popular activation functions used in DL
applications, such as object detection, image classification, and object type recognition, and
their positions in DL models, as shown in Table 7.

Figure 6. Pictorial Representation of Activation Function Responses.

4.1. Loss Function in Deep Learning


Developing proper cost functions for CV-based tasks has been a long-standing re-
search direction to improve the ability of the present schemes. Its primary purpose is to
evaluate the difference between the actual value of the samples and the estimated value.
The robustness and convergence of the recommended system mainly depends on the value
of the cost function.
The CV society has witnessed progress in image classification and object detection in
the recent years. Improvements to the framework design, of, for instance, single-step deep
detectors and two-step deep detectors, have accelerated the state-of-the-art (STA) incredibly.
Recently, several innovative approaches have been introduced in the cost function design
and the loss-based training schemes for deep architectures. Liu et al. [85] proposed a
powerful convergence simulation-driven evolutionary search approach (CSE–Autoloss) to
speed up searches by regularizing the rationality of the loss candidates using two modules
(convergence property verification (CPV) and model optimization simulation (MOS)).
Sensors 2023, 23, 4832 16 of 35

Table 6. Summary of the Activation Functions in DL Applications.

Functions Formula Advantage Disadvantage


Dramatically declines gradients during
Suitable for light Networks. Used in back-propagation. Has the nature of
Sigmoid 1 feedforward NNs. Bounded and gradient saturation. Slow convergence and
f (x) = e − x +1
differentiable actual function. non-zero centered output lead the gradient
updates to propagate in various directions.
It presents outstanding training
It generates dead neurons during
e x −e− x performance for MLP NNs. Generates zero
Tanh f (x) = e− x +e x
computation. High degree computational
centered output to assist the
complexity.
bac-kpropagation process.
Faster learning activation compared to
others. Most successful and widely
It has the nature of over-fit compared to a
employed function. Presents outstanding
x sigmoid function. Insubstantial during the
1+e−x , if x>0 performance and generalization in DL
ReLU − x
e −1 training process and leads to some of the
architectures compared to sigmoid and
1+e−x , if x<0 gradients dying. It is not a zero-centered
Tanh functions. Simple to optimize. No
function.
gradient saturation problems. Low
computational cost.
It can solve the problem of gradient
vanishing using identity values. Ability to
f ELiU ( x ) = x, if x > 0 learn characteristics of DL systems A high degree of computational
ELU
Γe x − 1, if x ≤ 0 improves. Can minimize the computational complexity.
complexity of using the mean unit action
function.

e xi It is used for multivariate classification Not suitable for binary classification


Softmax f (x) = ∑ j e xi tasks. problems.
It has smoothing and non-zero gradient
properties to improve stabilization and
A high degree of Computational
Softplus f ( x ) = log(1 + e x ) performance of DL with fewer epochs to
complexity.
convergence during the training process. It
can handle the vanishing gradient problem.
Uses automatic search approaches to
compute the function. Presents outstanding
x optimization and generalization outcomes.
Swish f (x) = e − x +1 A high Computation complexity.
Does not suffer from problems of gradient
vanishing.
It requires simple scalar inputs.
It presents excellent optimization and
generalization outcomes. Does not suffer
x
1+e−x , if x>0 from problems of gradient vanishing.
ELiSq e − x −1 Requires simple scalar inputs. It reduces
1+e−x , if x<0
the problem of the gradient vanishing to
improve information flow.
f ( x ) = max (w1T x +
Maxout Easily to generalize. A high computational complexity.
bi , ··, ··, wnT x + bi )
Sensors 2023, 23, 4832 17 of 35

Table 7. Types and Positions of Activation Functions in DL Models.

Models Hidden Layers Output Layers


SeNet ReLU Sigmoid
ReseNeXt ReLU Softmax
AlexNet ReLU. Softmax
DenseNet ReLU. Softmax
GoogleNet ReLU. Softmax
EfficienNet ReLU. Softmax
MobileNet ReLU. Softmax
ResNet ReLU. Softmax
ImageNet ReLU. Softmax
SqueezNet ReLU. Softmax
VGGNet ReLU. Softmax
Inception ReLU. Softmax

The loss function consists of classification loss (Cls) and location loss (Lls). The deep
two-step object detector algorithms equip a hybrid of both L1 loss and Cross-Entropy [86]
for regression and Bbox classification. In contrast, the deep single-step object detector
algorithms suffer from severe positive–negative instance imbalance, due to dense sampling
of possible object locations. Lin et al. [37] proposed Focal Loss to solve the imbalance
problem. However, optimizing object detectors with traditional detection approaches
to loss functions may result in sub-optimal solutions due to limited connections with
performance evaluation metrics. Therefore, Jiang et al. [87] predicted IOU during training,
IOU loss series in IOU loss, bounded IOU loss, and generalized IOU loss. To directly
optimize IOU between estimated and actual values, IOU loss and distance IOU loss are
used. This work epitomizes the essence of developing practical loss functions toward better
orientation with performance evaluation metrics for object detection tasks.

4.2. Classification Loss Functions in Deep Learning


This section explains the most common loss functions employed in Deep learning for
classification tasks. Table 8 presents a summary of classification loss function formulae.

Table 8. Summary of Classification Loss Functions in Deep Learning

Loss Functions Mathematical Formula


Hinge Loss L(z) = max (0, 1 − t.Z )z = w.X + b
Squared Hinge Loss L( Q, Q̂) = ∑nj6=i (max (0, 1 − Qi .Q̂i )2 )
E
DKL ( E|| B) = ∑i E(i) log B(i)
(i )
Kullback–Leibler Divergence = ∑i E(i) (logE(i) − logB(i) )
= ∑i E(i) logE(i) − ∑i E(i) logB(i)
Cross Entropy Loss L( P, γ) = ∑in=0 γi log Pi

4.3. Location Loss Functions in Deep Learning


This section explains the most common loss functions employed in Deep learning for
classification tasks. Table 9 presents a summary of Location loss function formulae.
Sensors 2023, 23, 4832 18 of 35

Table 9. Summary of Location Loss Functions in Deep Learning

Location Loss Functions Mathematical Formula


Absolute Loss L(Y, f ( X )) = |y − f ( x )|
Sum of Absolute Differences L(Y, f ( X )) = ∑in=1 |yi − f ( xi )|
Mean Absolute Error L(Y, f ( X )) = 1
n ∑in=1 |yi − f ( xi )|
Mean Square Error L(Y, f ( X )) = 1
n ∑in=1 | xi − yi |2
1
2 (y − f ( x ))2 , for |y − f ( x )| ≤ λ
Huber Loss
λ|y − f ( x )| − 21 λ2 , otherwise

Regression-based problems using loss functions have merit and limitations. Table 10
shows some of the pros and limitations of commonly used loss functions in regression-
based problems.

Table 10. Summary of the Loss Functions in Regression-based Problems.

Loss Functions Advantage Disadvantage


The GD has only global minima.
Mean Square Error No local minima. Not robust if the samples consist
Loss penalizes the network architecture of outliers.
for making large mistakes.
High computational cost.
Mean Absolute Error
More robust compared to MSE. Has a local minima.
Loss
large global for small loss
Outliers are handled wisely.
Requires extra hyperparameter
Huber Loss No local minima.
optimization techniques.
It is differential at zero.

5. Optimization Algorithms in Deep Learning


Optimization Algorithms (OAs) are vital approaches for updating DL/ML parameters
and reducing the value of the loss function [88,89]. Understanding the principles of various
OAs and their roles in hyperparameter tuning improve the performance of the DL/ML
architectures. This is carried out by rapidly adjusting the weights and other parameters
until the objective function convergence.
However, optimization provides a means to reduce the cost function for DL architec-
tures. The aims of OA and DL are different. Substantially, optimization approaches explore
the suitable architecture and reduce errors with less computational cost within the given
dataset samples. Furthermore, several researchers have conducted experiments to solve
the noticeable challenges using analytical and numerical solutions. The most common
tricky optimization challenges in Deep Learning are vanishing gradient, local minima,
and saddle points.
Back-Propagation (BP) is an approach to training nets. The approach repeats two
process cycles, propagation and updating weights. Training errors from the output layer
propagate to the other nodes backwards. Errors are utilized to compute the cost function’s
gradient concerning the parameter in the net. Then, the gradient is fed to the optimization
approach, which utilizes it to update the weights to diminish the cost function. Moreover,
the gradient of the objective function is mainly dependent on the dataset samples utilized
and the gradient descent approach employed [89].
The most well-known OAs, implemented in various methods to decrease the cost
function and fasten the learning of the architectures, are the following: Gradient Descent
(GD) [90], Stochastic Gradient Descent (SGD) [91], Nesterov Momentum (NM) [92], Ada-
grad [93], Adadelta [94], RMSProp [95], Adaptive Momentum (Adam) [96], and Adapg [88].
Sensors 2023, 23, 4832 19 of 35

Gradient Descent (GD): GD is a well-known optimization algorithm [90]. It is a


technique for decreasing an objective function F (δ) that is parametrized by an architecture’s
parameters δεRd by updating the parameters in the opposite direction of the gradient of
the objective function F (δ). The learning rate, φt , determines the size of the stages to reach
a local minimum. Mathematically, it is defined in Equation (4).

δt+1 = δt − φt 5 F (δt ) (4)

Hence, φt is the LR , and 5 F (δt ) is the gradient of the cost function for the tth iterate.
Stochastic Gradient Descent (SGD): this updates the parameters (δt ) frequently, so
the objective function is subject to wild swings, due to the SGD [91] algorithm’s rapid
gradient computations and improvement. Nevertheless, a sluggish learning rate can
improve SGD, resulting in a lengthy training period. In addition, the architecture’s speed
is hampered by the frequent transfer of data between GPU memory and local memory.
The mathematical process of the SGD algorithm is depicted in Equation (5).

δt+1 = δt − φt 5 Fi (δt ) (5)

Hence, Fi (δ) , l (yi , f δ ( xi )) at the tth iteration, randomly pick i and update the parameter.
Nesterov Momentum (NM): In this method, the gradient is calculated based on fu-
ture positions of the parameters rather than the current positions of the parameters [92].
An increase in momentum does not indicate where the parameters end up. A mathematical
representation of the NM algorithm can be found in Equation (6).

mt = β t−1 + (1 − β) 5 Fi (δt )
(6)
δt+1 = δt − αt mt

where, β is the value of momentum (m) at the tth iteration.


Adagrad: The Adagrad is a well-known OA utilized in DL architectures [93]. It is an
approach that selects the LR (φ) based on the situation. Since the gradient and LR values
are inversely proportional, it is suitable for allocating with sparse data. Dean et al. [97]
showed that Adagrad significantly enhanced the robustness of the SGD and they utilized
it for training large-scale frameworks at Google to detect cats. It scales the LR (φ) for
each parameter according to the history of the gradients for that parameter (δ), which is
done by dividing the current gradient in the update rule by the sum of the past gradients.
Mathematically, it is defined in Equation (7).

Gt = Gt−1 + ∆F (δt )2
φ (7)
δt+1 = δt − √ ∆F (δt )
Gt + e

where G is the sum of the past gradients and e is a small value for numerical stability.
However, the Adagrad approach has the disadvantage of treating all the past gradients
equally and manually selecting global LR. It also uses exponentially weighted decay for
the history gradients. It is suggested that an Adadelta algorithm solves these limitations.
Adadelta: The Adadelta optimization approach was derived from the Adagrad ap-
proach so as to improve the following limitations of the Adagrad [94]:
• The continual decay of φs throughout the training phase;
• The requirement for a manually selected global learning rate.
Thus, it combines the merits of the Adagrad and Momentum approaches. Mainly, it scales
the LR based on the past gradient. Nevertheless, it only utilizes the latest time window
instead of the whole history, as is the case for Adagrad. It also employs a component that
Sensors 2023, 23, 4832 20 of 35

serves a momentum term, which sums up historical updates. A mathematical representa-


tion of the Adadelta algorithm can be found in Equation (8)

E[∆F (δ)]t = η E[∆F (δ)]t−1 + (1 − η )∆F (δt )


E[∆F (δ)2 ]t = η E[∆F (δ)2 ]t−1 + (1 − η )∆F (δt )2
q
E[δ̂2 ]t−1 + e
δˆt = − p ∆F (δt ) (8)
E[∆F (δ)2 ]t + e
E[δ̂2 ]t = η E[(δ̂)2 ]t−1 + (1 − η )(δ̂)2t
δt+1 = δt + (δ̂)t

where, η is weight decay and e is a small value for numerical stability.


RMSProp: Tieleman et al. [95] proposed an RMSProp algorithm to solve the problem
of the LR vanishing in the Adagrad approach. It makes use of the weight-decaying mean
of previous gradients [98]. A mathematical representation of the Adagrad algorithm can be
found in Equation (9).

E[∆F (δ)2 ]t = η E[∆F (δ)2 ]t−1 + (1 − η )∆F (δt )2


φ (9)
δt+1 = δt − p ∆F (δt )
E[∆F (δ)2 ]t + e

where, η is weight decay, e is a small value for numerical stability, and φ is the learning rate.
Adaptive Momentum Estimation: The Adaptive Momentum Estimation [96] is an
alternative method that calculates adaptive LRs for each parameter. Furthermore, it stores
the exponential weighted-decaying mean of the historical squared gradients. It combines
the RMSProp and momentum approaches with a bias correction mechanism. Adam’s up-
date rule consists of the following steps, and, mathematically, it is defined in Equation (10).

mt = β 1 mt−1 + (1 − β 1 )∆F (δt )


vt = β 2 vt−1 + (1 − β 2 )∆F (δt )2
mt
m̂t =
1 − βt1 (10)
vt
vˆt =
1 − βt2
φ
δt+1 = δt − √ m̂t
vˆt + e

Hence, β 1 can be 0.9, β 2 can be 0.999, and e is a small value for numerical stability. mt
the mean gradient, vt is the uncentered variance of the gradients.
Adapg: The Adapg is also a new optimization algorithm, which combines both the
Adadelta and Adam optimizers [88]. Mathematically, it is defined in Equation (11).

E[∆F (δ)]t = η E[∆F (δ)]t−1 + (1 − η )∆F (δt )


E[∆F (δ)2 ]t = η E[∆F (δ)2 ]t−1 + (1 − η )∆F (δt )2
q
E[δ̂2 ]t−1 + e
δˆt = − p E[∆F (δ)]t (11)
E[∆F (δ)2 ]t + e
E[δ̂2 ]t = η E[(δ̂)2 ]t−1 + (1 − η )(δ̂)2t
δt+1 = δt + (δ̂)t

where, η is a weight decay and e is a small value for numerical stability.


Sensors 2023, 23, 4832 21 of 35

The optimization algorithms have been widely utilized to reduce errors and accelerate
architecture processing time with less computational cost by updating the parameters on the
dataset samples. A comparison study [90] of optimization approaches for DL architectures
using four publicly available datasets was conducted to investigate the efficiency of the
approaches. The datasets were labeled as Faces in the Wild (LFW), MNIST, Kaggle Flowers,
and CIFAR10 by pointing out their various attributes against SGD, NM, Adagrad, Adadelta,
RMSProp, and Adam OAs. Zaheer et al. [99] conducted a study of OAs on training DL
architectures involving the learning of the parameters to meet the loss function to reduce the
loss during the training phase. They employed six methods using different datasets: MNIST,
CIFAR10, FASHIONMNIST, and CIFAR100 on SGD, NM, Adagrad, Adadelta, RMSProp,
and Adam approaches. They achieved the optimal training results for FASHIONMNIST
1.0 with RMSProp and Adam at 400 epochs, MNIST 1.0 with RMSProp and Adam at
200 epochs, CIFAR100 1.0 with RMSProp and Adam at 100 epochs, and CIFAR10 1.0 with
RMSProp and Adam at 200 epochs. Their experimental results illustrated that the Adam
optimizer performed outstandingly at the testing stage and RMSProp with Adam at the
training step.
To summarize, RMSProp is Adagrad’s extension designed to alleviate the significantly
reduced LR. It is identical to Adadelta, except that Adadelta utilizes the RMS of parameter
updates in the numerator update rule. Finally, Adam summarizes bias correction and
momentum to RMSProp. RMSProp, Adam, and Adadelta are similar approaches that
outperform in related fashions. According to Zaheer et al. [99], its bias-correction aids
Adam optimizer in outperforming RMSProp during testing and RMSProp with Adam
during training. From various studies and papers, Adam might be the special optimization
algorithm overall choice [100].

6. Application of DCNN for Vehicle Detection and Classification


This section discusses various difficulties and challenges in vehicle detection and
classification, the application of DCNN, and a review of related works.

6.1. Difficulties and Challenges


This section discusses the difficulties and challenges of detecting, recognizing, and clas-
sifying vehicular objects.
Research communities have, for a long time, focused on the question, “What are
the difficulties and challenges in vehicle object detection, classification, and recognition?”
This question is not an easy one to answer, being a question that addresses other areas of
object detection tasks, such as pedestrian detection and traffic sign detection and recog-
nition. Various constraints, difficulties, and challenges arise in attempting to answer the
question, depending on objectives and assignments [101]. However, the following are
common challenges and difficulties frequently seen in appearance-based object detection
and classification tasks: weather conditions, various camera viewpoints, vehicle size, ve-
hicle color, vehicle inter-class variation; speed-up of classification and detection, correct
vehicle localization, dense and occluded vehicle detection, and classification. Weather
conditions, such as heavy fog, snowing, rain, snowstorms, dusty blasts, and low light
conditions have a significant impact on detection accuracy and processing time. As a result
of these conditions, visibility is inadequate for accurate detection of vehicles on the roads,
resulting in traffic accidents. A clear view can be achieved by developing successful image
enhancement techniques to gain good visuals. Providing clear images to detection systems
can, thus, improve the performance of vehicle detection and tracking in intelligent visual
surveillance systems and autonomous vehicle applications. Furthermore, by utilizing effi-
cient image processing techniques [77], various vehicle detection approaches, such as Deep
learning, ensemble learning, and other real-time-based vehicle detection using camera
sensors, have grown in importance in autonomous vehicles due to their high detection
accuracy, and have, thus, become significant in self-driving applications.
Sensors 2023, 23, 4832 22 of 35

6.2. DL in Vehicle Detection


This section summarizes related works and their findings on vehicle detection using
various DL approaches.
The rapid growth in digital image processing and computing systems has enabled
the robust, accurate, and efficient employment of CV-based vehicle detection techniques.
However, the framework efficiency mainly depends on the type of vehicles, illumination
and light, size of vehicles, inter-class and intra-class variations, environment, and occlusion
and blurred conditions. Considering these challenges and the difficulty of vehicle detection,
directly utilizing generic detection networks is not an optimal solution. There may be
some priors that can be used to improve vehicle detection. Table A1 summarizes related
comparisons of real-time DL architectures from the literature review. The reason for the
different reported results can be attributed to various factors: the type of loss function
utilized, the different datasets used, various hyperparameters, the framework of the model,
and the type of hardware used.
In the early stages of research, before the DL era, vehicle detection was mainly based
on sliding windows, developed by Viola and Jones. Dense image grids were encoded by
handcrafted features followed by a training classifier to explore and locate objects [102].
Haselhoff and Kummer [103] proposed a cascade of boosted classifier, Haar, and triangle
features with a Kalman filter for vehicle detection, and achieved good performance in
determining the vehicle’s position accurately. After the rapid growth of DL in image
classification, vehicle detectors based on DL significantly outperformed traditional vehicle
object detectors.
The current vehicle detection networks based on DL are extended from generic sys-
tems, such as YOLO, SPPNet, SSD, Faster RCNN, and Fast RCNN. Multi-scale learning
methods have been used a lot in detecting vehicles because they can handle a lot of different
sizes and scales.
Kim et al. [104] proposed a YOLOv3-based architecture that combined prediction
layers using SPPNet to complement the detection accuracy for multi-scale variations in
traffic surveillance data. Chen et al. [105] proposed an inception–SSD algorithm for small
vehicle detection, which was found to be more suitable for vehicle detection on various
aspect ratios and scales of default bounding boxes. They made predictions on the KITTI and
UVD datasets. They developed a trade off between speed and vehicle detection accuracy,
based on the SSD algorithm. To improve multi-scale detection, Zhao et al. [106] proposed
the feature pyramid enhancement strategy (FPES) [44], based on semantic information,
detailed features, and receptive fields [106]. Cascade detection and adaptive threshold
acquisition approaches for the object detection module (ODM) stage were also presented to
improve network accuracy.
Zhang et al. [107] developed an enhanced version of the RetinaNet technique to
improve the representation of feature maps using octave convolution and to reduce gradient
propagation in the extraction of multi-scale features by employing a weighted feature
pyramid network (WFPN). Their approach effectively handled gradient propagation at
various levels and low-resolution problems, but it was minor in performance. Unlike this
approach, Wang et al. [108] proposed a focal loss-based RetinaNet algorithm, which was
utilized to resolve issues of critical class imbalance in the standard one-step object detector,
so as to improve performance.
Moreover, some algorithms focus on contextual information for multi-scale feature
learning. Vehicle objects have a relationship with the surrounding context, namely, color,
shadows, the structure of vehicles, and size and shape, which have become an effective
means to improve detection performance. Hu et al. [73] proposed SINet, based on a
scale-insensitive ConvNet for fast detection of vehicles with a significant variance in scales.
They utilized context-aware RoI polling to handle the contextual information of the original
structure of small objects. In addition, they proposed a multi-branch detection algorithm
to reduce the intra-class distance features. Luo et al. [109] developed a state-of-the-art
architecture that can be used to effectively detect multi-scale vehicle targets in traffic scenes.
Sensors 2023, 23, 4832 23 of 35

They increased the usage of the architecture in the following ways: NAS optimization
and feature enrichment. There are several steps in this process. First, they implemented a
Retinax-based image adaptive correction algorithm to improve image quality and minimize
shadow and illumination effects. Then, they utilized a backbone model, NAS, for feature
extraction in order to produce the best cross-layer connection for extracting multiple layers
of features. Finally, they used object feature enrichment to integrate the multiple layers of
features and contextual data.
Beyond designing robust or context-assisted object detectors, several studies have
been conducted on various approaches. Nguyen et al. [81] proposed an improved system
based on faster RCNN for fast vehicle detection. They replaced the NMS algorithm with the
Soft-NMS algorithm to solve the problem of duplicate proposals, and a contextual-aware
RoI pooling layer was adopted to adjust the proposals to a specified size without losing
crucial contextual information. At the end of the MobileNet algorithm, the framework of
depth-wise separable convolution is used to generate a classifier for each identified vehicle.
Wang et al. [22] proposed an R-FCN algorithm equipped with deformable convolution
and RoI pooling for vehicle detection. It has a better detection time and more precision.
Wang et al. [35] conducted comparative studies on the most widely employed algorithms,
Faster RCNN, RetinaNet, YOLOv3, RFCN, and SSD. They showed that RFCN is very
powerful for generalizing real scenes and has outstanding detection on rainy days and at
nighttime. Moreover, the SSD network also has good generalization ability and can detect
most target vehicles in an environment with poor lighting conditions.
Arora et al. [110] recommended a fast RCNN architecture to detect vehicles under
various environmental conditions. The proposed model obtained an average of recall,
accuracy, and precision of 98.44%, 94.20%, and 90%, respectively. Charouh et al. [111]
suggested a resource-efficient CNN-based model for detecting moving vehicles on large-
scale datasets. Rajput et al. [112] proposed a toll management system, using Yolov3
architecture, for vehicle identification and classification. Amrouche and his colleagues
proposed a Yolov4 architecture for a real-time vehicle detection and tracking system [113].
Wang et al. [114] introduced an integrated part-aware refinement network, which combines
multi-scale training and component confidence generation strategies in vehicle detection.
This system improves detection accuracy and time taken in detecting various vehicles on
publicly available datasets.
Faris et al. [115] proposed a Yolo-v5 architecture vehicle detector using the techniques
of transfer learning on publicly available datasets, namely, PKU, COCO, and DAWN.
The experimental result showed that the proposed model achieved a state-of-the-art in the
detection of various vehicles. Huang et al. [116] introduced an embedded system of Yolov4,
K-means and TensorRT to detect the real-time target from UAV images. They achieved a
confidence and miss detection rate of 89.6% and 3.8%, respectively. Furthermore, to balance
the architecture’s detection accuracy and computational complexity, Qiu et al. [117] intro-
duced a linear transform approach, increasing the detection accuracy and the detection
frame using simple operations over the input image. However, the road and the various
shapes and sizes of vehicles affect the system’s detection accuracy and detection frame in
the detecting and recognizing scheme. Yolov7-RAR was proposed to minimize the miss
detection of non-linear features and speed up the architecture in [118].
To further improve detection accuracy, some researchers implemented an ensemble
learning technique on pre-trained models. Mittal et al. [119] proposed an EnsembleNet
model for vehicle detection and estimation of traffic density with a detection accuracy of
98%. Figure 7 is a sample block diagram of the vehicle detection process, using multi-type
vehicle images, and based on fine-tuned DNN models.
Sensors 2023, 23, 4832 24 of 35

Figure 7. Vehicle Detection Process Based on Fine-tuned DNN

6.3. DL in Vehicle Classification


This section summarizes related works and their findings in vehicle classification
using various approaches.
Vehicle classification is a crucial part of the ITS and has several applications: intelligent
parking systems, driver assistance, fleet management, maintenance systems, traffic flow
statistics, automatic toll collection, accident analysis, investigation, and transportation
system design and monitoring. With the rapid growth of image classification in recent
years, much research has been done on computer vision-based vehicle classification using
traditional object classifiers and CNN-based object classifiers, such as SVM, to train classifi-
cation networks. However, the efficiency of the traditional approach is not robust due to
unstable feature extraction from various changes, such as occlusion, blurring, illumination
and lighting effects, environment, size and shape of vehicles, and diverse poses. Consider-
ing these problems in vehicle classification, directly employing the traditional approach
is not an acceptable solution to classify vehicle categories/types in various conditions
with a lower error rate. Further improvement in vehicle classifiers should be considered a
core task.
Several kinds of research have been utilized in vehicle object classification tasks,
namely vehicle type classification, vehicle damage type classification and detection, ve-
hicle target classification and recognition, vehicle model, type, and manufacturer, color
recognition, and vehicle counting. In recent years, diverse classifiers of model-based and
vision-based approaches have been utilized. The model-based approaches recover the
vehicle’s length, height, and width from various view images for vehicle classification.
In contrast, the vision-based approaches extract appearance features from either vehicle
side view, rear view, or front view images to classify vehicle types. Gupte et al. [120]
proposed a non-rigid model-based approach to classifying vehicles by comparing the
projection with the vehicle image to determine the class of the vehicle. Petrovic et al. [121]
proposed a Sobel edge response type, direct normalized gradients, edge orientation, locally
normalized gradients, and Harris approaches for integration to classify vehicle types. Psyl-
los et al. [122] proposed SIFT features to recognize the model, logo, and manufacturer of a
vehicle. Peng et al. [123] introduced a system to designate a vehicle by vehicle front, color,
type, and width for vehicle type classification. However, this approach utilizes handcrafted
features and is difficult to visualize well enough. To handle the problems, Dong et al. [124]
proposed a semi-supervised ConvNet algorithm for vehicle type classification on the BIT-
vehicle dataset. They used sparse filtering to capture rich and discriminative information
about vehicles. To improve the vehicle type classification of the model, Awang et al. [125]
proposed an enhanced sparse-filtered ConvNet algorithm with a layer-skipping strategy
(SF-ConvNetLS) to classify vehicle types. They employed three channels of SF–ConvNetLS
as the feature extraction approach.
The DL outperformed conventional object classifiers after the rapid development of DL
applications in image classification. The current vehicle object classifier based on DL has
Sensors 2023, 23, 4832 25 of 35

dramatically shifted from the model-based approach to the vision-based approach to improve
classification accuracy and to resolve the challenges faced during real-time classification.
Several DL studies have been conducted to address classification problems since the
excellent performance, exhibited by Krizhevsky et al. [24] in the ImageNet LSVTC [126]
using DConvNets. Several DL studies have been conducted to address classification prob-
lems. Szegedey et al. [28] introduced a novel DNN using Inception networks that maximize
the depth of architectures without increasing the number of parameters. Simonyan and
Zisserman [30] demonstrated that 3 × 3 receptive fields in the first conv layers were more
effective than 11 × 11 receptive fields with stride four or 7 × 7 with a stride of 2, which
improved the performance on ILSVRC.
Manugmai and Nuthong [127] proposed a DL-based vehicle classification approach to
classify vehicle type and color. They showed that the ConvNet architecture outperformed
the conventional machine learning approaches in classification. Wang et al. [128] proposed
AVC using center-strengthened ConvNet to extract more features from a central image
by ROI pooling, based on the VGG model joined with the ROI pooling layer to obtain
elaborate feature maps. Awang and Azmi [129] presented a ConvNet architecture with a
skipping strategy model to classify vehicles with identical sizes of different object classes,
and Jahan et al. [130] proposed real-time vehicle classification using ConvNet. They used
two ways to find features and classify different types of vehicles.
Lee and Chung [131] proposed a DL-based vehicle classification using an ensemble
of K local experts and global networks. They used multi-crop testing, network training
of k local experts, and global networks with an ensemble of AlexNet [126], ResNet [29],
and GoogleNet [28] to classify various vehicles.They achieved outstanding performance
on the MIT–CCD classification challenges. In order to improve the mean precision of the
models, Liu et al. [132] proposed a two-step approach of DA and an ensemble of ConvNet
algorithms to solve the imbalance dataset problem in calibrating with hyperparameter opti-
mization of parameters. They showed that the ensemble technique with DA improved the
precision. Liu et al. [80] presented a semi-supervised network motivated by a combination
of various DNNs with DA techniques based on GAN. It includes several steps to improve
classification accuracy on the MIO–TCD dataset.
Furthermore, Jagannathan et al. [133] proposed a GMM and ensemble DL approach to
detect and classify various moving vehicles on both the BIT-vehicle dataset and the MIO-
TCD dataset. They utilized adaptive histogram equalization and GMM to improve image
quality, and a steerable pyramid transform and Weber local descriptor (WLD) were used
to extract feature vectors. Then, the extracted feature vectors were fed into the ensemble
Dl approach for the vehicle classification task. They showed that the proposed model
outperformed the benchmark models on both datasets.
Table A2 summarizes the comparison of DL-based vehicle classification architectures
from the literature review. The type of loss function used, the different datasets used,
different hyperparameters, the framework of the model, and the type of hardware used all
lead to different results.

7. Future Directions
Despite the rapid growth and promising object detection and classification processing
in DL applications, there are still several open issues for future work.
Various methods for detecting and classifying small vehicles in publicly available
datasets have been developed. To enhance the classification and localization accuracy of
small vehicle objects under several occlusions, inter-class variation, intra-class variation,
illumination, light, environment, etc. it is necessary to modify the model architecture in the
following aspects:
Multi-task joint optimization and Multi-model information combination: Due to the
relationship between several tasks in vehicle object classification and detection, Multi-task
joint optimization has been studied by several researchers, such as the following: in person
re-identification [134], human action grouping and recognition [135], dangerous object de-
Sensors 2023, 23, 4832 26 of 35

tection [134], fast object detection [136], multi-task vehicle recognition and tracking [137],
multi-task vehicle pose estimation [138]. Moreover, several approaches have been integrated
to improve the performance of the architectures.
Scale and size alteration: Objects typically appear in a variety of scales and sizes, which
is more noticeable in small objects. For scale- or size-variant objects, multi-scale object
classifiers and detectors are required to maximize the robustness to scale and size changes.
Powerful backbone algorithms, such as ResNet, Inception, MobileNet, and AlexNet, can
be utilized for scale-/size-invariant detection and classification tasks. FPN generates
multi-scale feature maps and GAN-based narrow representation variations between small
and vast objects with lower computational complexity for the multi-scale detectors and
classifiers. The network offers insights into producing a meaningful feature pyramid
for scale-adaptive detectors. It is necessary to integrate cascade architecture and scale
distribution estimation to identify objects adaptively.
Spatial Correlations and Contextual Modeling: Spatial distribution plays an essential
role in object detection and image classification. Therefore, region proposal generation and
grid regression are employed to get probable object locations. However, the corrections
between several proposals and object classes are disregarded. In addition, the global
structure information is uncontrolled by position-sensitive score maps in RFCN. To solve
these problems, use of various techniques, such as sequential reasoning tasks and subset
selection, in a collaborative way is advocated.
Cascade Architecture: In the cascade network, a cascade of detectors is built in several
phases. However, the existing cascade architectures are made greedy, where previous
phases in cascades are fixed when training a new phase. So, the optimization of different
ConvNets cannot be accomplished, which makes the need for end-to-end optimization for
the ConvNet cascade architecture even more important.
Weakly supervised and Unsupervised Learning: Practically, it is inefficient and labor-
intensive to label a large volume of bounding boxes manually. To address this issue,
different architectures can be combined to perform exceptionally well by utilizing image-
level supervision to assign object classes to match object regions and object boundaries.
This technique leads to improved detection flexibility and minimized labor costs.
Model Optimization: A technique of model optimization in DL applications and
schemes is essential to balance accuracy, speed, and memory, by choosing an optimal
detector and classifier.
Detection or Classification in Videos: Real-time object classification and detection in
videos is a significant issue for video surveillance and autonomous driving. Conventional
object classifiers or detectors are usually designed for image-wise detection and classifica-
tion, while simply ignoring the correlations between video frames. An essential direction
of research is to enhance detection or classification performance by searching for spatial
and temporal correlations.
Lightweight Classification or Detection: The lightweight architectures have been
greatly compromised by classification errors developing in models. There is still a shortage
of detection accuracy. While great efforts have been made in recent years, the speed of
detection and of classification speed are not yet balanced.

8. Conclusions
In this paper, a comprehensive survey of some of the significant growth, successes,
and demerits associated with applying DL techniques in vehicle (object) detection and
classification is presented. To prove the efficiency of applying DL techniques in vehicle
(object) detection and classification, benchmark datasets, loss functions, activation func-
tions, and various experiments and studies recently implemented and completed in vehicle
detection and classification are reviewed. Detailed analysis of deep learning techniques
and reviews of some significant detection and classification applications in vehicle de-
tection and classification, in-depth analysis of their challenges and promising technical
improvements in recent years are addressed. Finally, we suggest many future directions in
Sensors 2023, 23, 4832 27 of 35

thoroughly understanding the object detection and classification landscape. This survey
is also meaningful for the growth of Nets and related learning frameworks, which offer
valuable insights and guidelines for future progress.

Author Contributions: Conceptualization M.A.B., Y.F. and H.F.; investigation. M.A.B., J.M., S.J. and
H.F.; writing-original draft preparation, M.A.B., J.M. and S.J.; Experiments, M.A.B., Z.U.A and Y.F.;
Review and editing, M.A.B., J.M. and A.K.; Supervision, Y.F.; Funding acquisition, A.K., H.F., S.J.
and S.M.S. All authors have read and agreed to the published version of the manuscript.
Funding: This research was sponsored by the Guangzhou Government Project under Grant No.
62216235 and the National Natural Science Foundation of China (Grant No. 622260-1).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare that they have no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Adam Adaptive Momentum


AI Artificial Intelligence
BP Back-propagation
CV Computer-vision
DCNNs Deep Convolutional Neural Networks
DL Deep Learning
DNNs Deep Neural Networks
EL Ensemble Learning
FC Fully Connected
GD Gradient Descent
GPUs Graphic Processing Units
HOG Histogram of Oriented Gradient
ITS Intelligence Transportation System
LBP Local Binary Pattern
LR Learning Rate
ML Machine Learning
OAs Optimization Algorithms
RCNNs Regional Convolutional Neural Networks
SGD Stochastic Gradient Descent
TL Transfer Learning

Appendix A
Appendix A.1

Table A1. Summary of Various Algorithms and Datasets utilized in Vehicle Detection .

Reference Dataset Used Network Findings


BIT-vehicle dataset.
YOLOv2-vehicle has higher precision and
Training 7880. YOLOv2.
average IOU than YOLOv2 and
Sang et al. [139] Validation 1970. Model-Comp
model-Comp. Model-Comp has a higher
Testing (CompCar YOLOv2-vehicle
average IOU than YOLOv2.
dataset) 800
Sensors 2023, 23, 4832 28 of 35

Table A1. Cont.

Reference Dataset Used Network Findings


YOLOv3.
The modified YOLOv3 has higher average
improved YOLOv3.
Xu et al. [79] COCO dataset precision than improved YOLOv3,
Faster RER-CNN.
YOLOv3, and Faster RER-CNN.
Modified YOLOv3
Faster RCNN.
EB.,BFEN. The BFEN +SLPN + PNW has higher than
Liu et al. [80] DETRAC dataset BFEN +2FC. Faster RCNN, EB, BFEN, BFEN +2FC,
BFEN + SLPN. and BFEN +SLPN.
BFEN + SLPN + PNW.
Faster RCNN with Inceptionv2 has higher
From JF-2 and Faster RCNN + Inceptionv2. mAP than SSD with Inceptionv2 but has a
Mansour et al. [140]
WORLD-VIEW satellites SSD + Inceptionv2. higher operation time than SSD with
Inceptionv2.
ResNet101, VGG16
RCNN(Alex) YOLOv4 + DA + TL has higher mAP than
COCO test set
Sowmya et al. [141] RCNN (VGG16) ResNet101, VGG16, RCNN(Alex),
PASCAL VOC 07 test set
SPPNet RCNN(VGG16), and SPPNet.
YOLOv4 + DA + TL.
Faster RCNN The improved Faster RCNN algorithm has
SSD higher AP than the original Faster RCNN,
KITTI test set MSCNN SSD, MSCNN, YOLO, and YOLOv2 on the
Nguyen [81]
LSVH test dataset. YOLO KITTI test set.
YOLOv2 MS-CNN has higher AP than Improved
improved Faster RCNN. Faster RCNN on the LSVH test set.
Faster RCNN
PN +FTN + Fusion PN + FTN + Fusion + Concant has higher
Wang et al. [142] DETRACT dataset. PN + FTN + Concant overall mAP than Faster RCNN, PN + FTN
PN + FTN + Fusion + + Fusion.
Concant.
DPM, Fast RCNN
Faster RCNN, YOLOv2
Faster RCNN with FPN
Faster RCNN with FPN + Improving RPN +
backbone
multilayer enhancement module +
MS-CNN
adaptive RoI pooling has higher AP than
KITTI benchmark. improved Faster RCNN,
Nguyen [83] DPM, Fast RCNN, Faster RCNN, YOLOv2,
PASCAL VOC 07. SINet
Faster RCNN with SPP, Improved Faster
Multitask CNN
RCNN, SINet and Multitask CNN on both
Faster RCNN with FPN +
datasets.
Improving RPN + multilayer
enhancement module +
adaptive RoI pooling.
DPM, RCNN, ACF , Faster
RCNN2 The MSVD-SPP has higher mAP than DPM,
Kim et al. [143] DETRACT test set. SA-FRCNN RCNN, ACF, Faster RCNN2, SA-FRCNN,
NANO NANO, and CompACT.
CompACT, MSVD-SPP.
YOLOv2, tiny YOLOv2, tiny SPPNet-YOLOv3 has higher mean average
Wang et al. [144] KITTI test set. YOLOv3. precision than YOLOv2, Tiny YOLOv2,
SPPNet-YOLOv3 and Tiny YOLOv3.
Sensors 2023, 23, 4832 29 of 35

Appendix A.2

Table A2. Summary of Various Algorithms and Datasets utilized in Vehicle Classification.

Reference Dataset Used Network Findings


CNN.
Own dataset. The CNN architecture has higher
Manugmai and Decision tree
Training = 686. classification accuracy than DNN(densely),
Nuthong. [127] Random forest
Testing = 228 Decision tree, and random forest.
DNN(Densely)
VGG-s.
The CS-CNN has higher accuracy than
Wang et al. [128] Caltech256 dataset VGG-verydeep-16.
VGG-s and VGG-very deep-16.
CS-CNN.
YOLOv3.
own dataset. The modified YOLOv3 has higher average
improved YOLOv3.
Jahan et al. [130] Training= 2240. precision than improved YOLOv3, YOLOv3,
Faster RER-CNN.
Testing = 560 and Faster RER-CNN.
Modified YOLOv3
AlexNet
ResNet18 The ensemble model of AlexNet, ResNet18,
Lee and Chung [131] MIO-CTD dataset GoogleNet and GoogleNet have lower error rates than
ensemble learning (AlexNet + the benchmark models.
ResNet18 + GoogleNet).
ResNet50
The DCEM-BS has higher precision than
ResNet50-BS
ResNet50, ResNet50-BS, ResNet101,
ResNet101
ResNet101-BS, ResNet152, ResNet152-BS and
ResNet101-BS
Liu et al. [132] MIO-CTD dataset DCEM.
ResNet152
ResNet152-BS has higher mean recall than
ResNet152-BS
ResNet50, ResNet50-BS, ResNet101,
DCEM
ResNet101-BS, ResNet152, and DCEM.
DCEM-BS.
ResNet50
ResNet101
GEM-AP has higher precision than baseline
ResNet152
networks and GEM-OE.
Liu et al. [80] MIO-CTD dataset Inceptionv4
GEM-OE has higher precision than baseline
Inceptionv3
architectures.
GEM-OE.
GEM-AP.
Ensemble deep learning approach has higher
GAN-based deep ensemble
recall than tiny YOLO with SVM,
approach
semi-supervised CNN, PCN with Softmax,
MIO-CTD dataset tiny YOLO with SVM
Jagannathan et al. [133] and TC-SF-CNNLS.
BIT-vehicle dataset semi-supervised CNN
TC-SF-CNNLS has higher recall than tiny
PCN with Softmax
YOLO with SVM, semi-supervised CNN,
TC-SF-CNNLS.
and PCN with Softmax.

References
1. Szeliski, R. Computer Vision: Algorithms and Applications; Springer Nature: Berlin, Germany, 2022.
2. Hassaballah, M.; Hosny, K.M. Recent advances in computer vision. Stud. Comput. Intell. 2019, 804, 1–84.
3. Javaid, S.; Zeadally, S.; Fahim, H.; He, B. Medical Sensors and Their Integration in Wireless Body Area Networks for Pervasive
Healthcare Delivery: A Review. IEEE Sens. J. 2022, 22, 3860–3877. [CrossRef]
4. Berwo, M.A.; Fang, Y.; Mahmood, J.; Retta, E.A. Automotive engine cylinder head crack detection: Canny edge detection with
morphological dilation. In Proceedings of the 2021 Asia-Pacific Signal and Information Processing Association Annual Summit
and Conference (APSIPA ASC), Tokyo, Japan, 14–17 December 2021; pp. 1519–1527.
5. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20– 25 June 2005; Volume 1,
pp. 886–893.
6. Mita, T.; Kaneko, T.; Hori, O. Joint haar-like features for face detection. In Proceedings of the Tenth IEEE International Conference
on Computer Vision (ICCV’05), Beijing, China, 17–21 October 2005; Volume 2, pp. 1619–1626.
Sensors 2023, 23, 4832 30 of 35

7. Zhang, G.; Huang, X.; Li, S.Z.; Wang, Y.; Wu, X. Boosting local binary pattern (LBP)-based face recognition. In Proceedings of the
Chinese Conference on Biometric Recognition, Guangzhou, China, 13–14 December 2004; Springer: Berlin/Heidelberg, Germany,
2004; pp. 179–186.
8. Javaid, S.; Saeed, N.; Qadir, Z.; Fahim, H.; He, B.; Song, H.; Bilal, M. Communication and Control in Collaborative UAVs: Recent
Advances and Future Trends. IEEE Trans. Intell. Transp. Syst. 2023, 1–21. [CrossRef]
9. Fahim, H.; Li, W.; Javaid, S.; Sadiq Fareed, M.M.; Ahmed, G.; Khattak, M.K. Fuzzy Logic and Bio-Inspired Firefly Algorithm
Based Routing Scheme in Intrabody Nanonetworks. Sensors 2019, 19, 5526. [CrossRef] [PubMed]
10. Javaid, S.; Fahim, H.; Zeadally, S.; He, B. Self-powered Sensors: Applications, Challenges, and Solutions. IEEE Sens. J. 2023, 1.
[CrossRef]
11. Wen, X.; Zheng, Y. An improved algorithm based on AdaBoost for vehicle recognition. In Proceedings of the 2nd International
Conference on Information Science and Engineering, Wuhan, China, 25–26 December 2010; pp. 981–984.
12. Broggi, A.; Cardarelli, E.; Cattani, S.; Medici, P.; Sabbatelli, M. Vehicle detection for autonomous parking using a soft-cascade
AdaBoost classifier. In Proceedings of the 2014 IEEE Intelligent Vehicles Symposium Proceedings, Ypsilanti, MI, USA, 8–11 June
2014; pp. 912–917.
13. Tang, Y.; Zhang, C.; Gu, R.; Li, P.; Yang, B. Vehicle detection and recognition for intelligent traffic surveillance system. Multimed.
Tools Appl. 2017, 76, 5817–5832. [CrossRef]
14. Ali, A.M.; Eltarhouni, W.I.; Bozed, K.A. On-Road Vehicle Detection using Support Vector Machine and Decision Tree Clas-
sifications. In Proceedings of the 6th International Conference on Engineering & MIS 2020, Istanbul, Turkey, 4–6 July 2020;
pp. 1–5.
15. Javaid, S.; Wu, Z.; Fahim, H.; Fareed, M.M.S.; Javed, F. Exploiting Temporal Correlation Mechanism for Designing Temperature-
Aware Energy-Efficient Routing Protocol for Intrabody Nanonetworks. IEEE Access 2020, 8, 75906–75924. [CrossRef]
16. Wei, Y.; Tian, Q.; Guo, J.; Huang, W.; Cao, J. Multi-vehicle detection algorithm through combining Harr and HOG features. Math.
Comput. Simul. 2019, 155, 130–145. [CrossRef]
17. Shobha, B.; Deepu, R. A review on video based vehicle detection, recognition and tracking. In Proceedings of the 2018 3rd
International Conference on Computational Systems and Information Technology for Sustainable Solutions (CSITSS), Bengaluru,
India, 20–22 December 2018; pp. 183–186.
18. Ren, H.; Li, Z.N. Object detection using generalization and efficiency balanced co-occurrence features. In Proceedings of the IEEE
International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 46–54.
19. Sun, Z.; Bebis, G.; Miller, R. On-road vehicle detection: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 694–711.
20. Ren, H. Boosted Object Detection Based on Local Features. Ph.D. Thesis, Applied Sciences, School of Computing Science,
Burnaby, BC, Canada, 2016.
21. Neumann, D.; Langner, T.; Ulbrich, F.; Spitta, D.; Goehring, D. Online vehicle detection using Haar-like, LBP and HOG feature
based image classifiers with stereo vision preselection. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los
Angeles, CA, USA, 11–14 June 2017; pp. 773–778.
22. Wang, Z.; Zhan, J.; Duan, C.; Guan, X.; Yang, K. Vehicle detection in severe weather based on pseudo-visual search and HOG–LBP
feature fusion. Proc. Inst. Mech. Eng. Part J. Automob. Eng. 2022, 7, 1607–1618. [CrossRef]
23. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587.
24. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 2017, 60 , 84–90. [CrossRef]
25. Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013,
104, 154–171. [CrossRef]
26. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December
2015; pp. 1440–1448.
27. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural
Inf. Process. Syst. 2015, 28 , 1137–1149. [CrossRef]
28. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with
convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June
2015; pp. 1–9.
29. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
30. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
31. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst.
2016, 29. Available online: https://proceedings.neurips.cc/paper_files/paper/2016/file/577ef1154f3240ad5b9b413aa7346a1e-
Paper.pdf (accessed on 25 April 2023).
32. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer:
Berlin/Heidelberg, Germany, 2014; pp. 740–755.
Sensors 2023, 23, 4832 31 of 35

33. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J.
Comput. Vis. 2010, 88, 303–338. [CrossRef]
34. Pal, S.K.; Pramanik, A.; Maiti, J.; Mitra, P. Deep learning in multi-object detection and tracking: State of the art. Appl. Intell. 2021,
51, 6400–6429. [CrossRef]
35. Wang, H.; Yu, Y.; Cai, Y.; Chen, X.; Chen, L.; Liu, Q. A comparative study of state-of-the-art deep learning algorithms for vehicle
detection. IEEE Intell. Transp. Syst. Mag. 2019, 11, 82–95. [CrossRef]
36. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of
the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg,
Germany, 2016; pp. 21–37.
37. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International
Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988.
38. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788.
39. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
40. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
41. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
42. Wen, H.; Dai, F. A Study of YOLO Algorithm for Multi-target Detection. J. Adv. Artif. Life Robot. 2021, 2, 70–73.
43. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections
on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA,
4–9 February 2017.
44. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
45. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings
of the International Conference on Machine Learning, PMLR, Lille, France, 6 July–1 July 2015; pp. 448–456.
46. Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern
Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855.
47. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636.
48. Yang, G.; Feng, W.; Jin, J.; Lei, Q.; Li, X.; Gui, G.; Wang, W. Face mask recognition system with YOLOV5 based on image
recognition. In Proceedings of the 2020 IEEE 6th International Conference on Computer and Communications (ICCC), Chengdu,
China, 11–14 December 2020; pp. 1398–1404.
49. Javaid, S.; Wu, Z.; Hamid, Z.; Zeadally, S.; Fahim, H. Temperature-aware routing protocol for Intrabody Nanonetworks. J. Netw.
Comput. Appl. 2021, 183–184, 103057. [CrossRef]
50. Song, X.; Gu, W. Multi-objective real-time vehicle detection method based on yolov5. In Proceedings of the 2021 International
Symposium on Artificial Intelligence and its Application on Media (ISAIAM), Xi’an, China, 21–23 May 2021; pp. 142–145.
51. Snegireva, D.; Kataev, G. Vehicle Classification Application on Video Using Yolov5 Architecture. In Proceedings of the 2021
International Russian Automation Conference (RusAutoCon), Sochi, Russia, 5–11 September 2021; pp. 1008–1013.
52. Berwo, M.A.; Wang, Z.; Fang, Y.; Mahmood, J.; Yang, N. Off-road Quad-Bike Detection Using CNN Models. In Proceedings of
the Journal of Physics: Conference Series, Nanjing, China, 25-27 November 2022; IOP Publishing: Bristol, UK, 2022; Volume 2356,
p. 012026.
53. Jin, X.; Li, Z.; Yang, H. Pedestrian Detection with YOLOv5 in Autonomous Driving Scenario. In Proceedings of the 2021 5th CAA
International Conference on Vehicular Control and Intelligence (CVCI), Tianjin, China, 29–31 October 2021; pp. 1–5.
54. Li, Y.; He, X. COVID-19 Detection in Chest Radiograph Based on YOLO v5. In Proceedings of the 2021 IEEE International
Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Fuzhou, China,
24–26 September 2021; pp. 344–347.
55. Berwo, M.A.; Fang, Y.; Mahmood, J.; Yang, N.; Liu, Z.; Li, Y. FAECCD-CNet: Fast Automotive Engine Components Crack
Detection and Classification Using ConvNet on Images. Appl. Sci. 2022, 12, 9713. [CrossRef]
56. Kausar, A.; Jamil, A.; Nida, N.; Yousaf, M.H. Two-wheeled vehicle detection using two-step and single-step deep learning models.
Arab. J. Sci. Eng. 2020, 45, 10755–10773. [CrossRef]
57. Vasavi, S.; Priyadarshini, N.K.; Harshavaradhan, K. Invariant feature-based darknet architecture for moving object classification.
IEEE Sens. J. 2020, 21, 11417–11426. [CrossRef]
58. Li, Q.; Garg, S.; Nie, J.; Li, X.; Liu, R.W.; Cao, Z.; Hossain, M.S. A highly efficient vehicle taillight detection approach based on
deep learning. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4716–4726. [CrossRef]
59. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the
2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361.
60. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237.
[CrossRef]
Sensors 2023, 23, 4832 32 of 35

61. Alvarez, J.M.; Gevers, T.; LeCun, Y.; Lopez, A.M. Road scene segmentation from a single image. In Proceedings of the European
Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 376–389.
62. Ros, G.; Alvarez, J.M. Unsupervised image transformation for outdoor semantic labelling. In Proceedings of the 2015 IEEE
Intelligent Vehicles Symposium (IV), Seoul, Republic of Korea, 28 June–1 July 2015; pp. 537–542.
63. Zhang, R.; Candra, S.A.; Vetter, K.; Zakhor, A. Sensor fusion for semantic segmentation of urban scenes. In Proceedings of the
2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 1850–1857.
64. Ros, G.; Ramos, S.; Granados, M.; Bakhtiary, A.; Vazquez, D.; Lopez, A.M. Vision-based offline-online perception paradigm for
autonomous driving. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI,
USA, 5–9 January 2015; pp. 231–238.
65. Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3D Object Representations for Finet-Grained Categorization. In Proceedings of the 4th
International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 8 December 2013.
66. Espinosa, J.E.; Velastin, S.A.; Branch, J.W. Motorcycle detection and classification in urban Scenarios using a model based on
Faster R-CNN. arXiv 2018, arXiv:1808.02299.
67. Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-sign detection and classification in the wild. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2110–2118.
68. Li, X.; Flohr, F.; Yang, Y.; Xiong, H.; Braun, M.; Pan, S.; Li, K.; Gavrila, D.M. A new benchmark for vision-based cyclist detection.
In Proceedings of the 2016 IEEE Intelligent Vehicles Symposium (IV), Gotenburg, Sweden, 19–22 June 2016; pp. 1028–1033.
69. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset
for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223.
70. Guerrero-Gómez-Olmedo, R.; López-Sastre, R.J.; Maldonado-Bascón, S.; Fernández-Caballero, A. Vehicle tracking by simultaneous
detection and viewpoint estimation. In Proceedings of the International Work-Conference on the Interplay Between Natural and
Artificial Computation, Mallorca, Spain, 10–14 June 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 306–316.
71. Luo, Z.; Branchaud-Charron, F.; Lemaire, C.; Konrad, J.; Li, S.; Mishra, A.; Achkar, A.; Eichel, J.; Jodoin, P.M. MIO-TCD: A new
benchmark dataset for vehicle classification and localization. IEEE Trans. Image Process. 2018, 27, 5129–5141.
72. Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.C.; Qi, H.; Lim, J.; Yang, M.H.; Lyu, S. UA-DETRAC: A new benchmark and protocol
for multi-object detection and tracking. Comput. Vis. Image Underst. 2020, 193, 102907. [CrossRef]
73. Hu, X.; Xu, X.; Xiao, Y.; Chen, H.; He, S.; Qin, J.; Heng, P.A. SINet: A scale-insensitive convolutional neural network for fast
vehicle detection. IEEE Trans. Intell. Transp. Syst. 2018, 20, 1010–1019. [CrossRef]
74. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of
the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255.
75. Li, F.F.; Fergus, R.; Perona, P. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 594–611.
76. Griffin, G.; Holub, A.; Perona, P. Caltech-256 object category dataset. 2007. Available online: https://authors.library.caltech.edu/
7694/?ref=https://githubhelp.com (accessed on 25 April 2023).
77. Kenk, M.A.; Hassaballah, M. DAWN: Vehicle detection in adverse weather nature dataset. arXiv 2020, arXiv:2008.05402.
78. Zuraimi, M.A.B.; Zaman, F.H.K. Vehicle Detection and Tracking using YOLO and DeepSORT. In Proceedings of the 2021 IEEE
11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), Penang, Malaysia, 3–4 April 2021; pp. 23–29.
79. Xu, B.; Wang, B.; Gu, Y. Vehicle detection in aerial images using modified yolo. In Proceedings of the 2019 IEEE 19th International
Conference on Communication Technology (ICCT), Xi’an, China, 16–19 October 2019; pp. 1669–1672.
80. Liu, W.; Liao, S.; Hu, W.; Liang, X.; Zhang, Y. Improving tiny vehicle detection in complex scenes. In Proceedings of the 2018
IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6.
81. Nguyen, H. Improving faster R-CNN framework for fast vehicle detection. Math. Probl. Eng. 2019, 2019, 3808064. [CrossRef]
82. Dai, X. HybridNet: A fast vehicle detection system for autonomous driving. Signal Process. Image Commun. 2019, 70, 79–88.
[CrossRef]
83. Nguyen, H. Multiscale Feature Learning Based on Enhanced Feature Pyramid for Vehicle Detection. Complexity 2021,
2021, 5555121. [CrossRef]
84. Fan, Q.; Brown, L.; Smith, J. A closer look at Faster R-CNN for vehicle detection. In Proceedings of the 2016 IEEE intelligent
vehicles symposium (IV), Gotenburg, Sweden, 19–22 June 2016; pp. 124–129.
85. Liu, P.; Zhang, G.; Wang, B.; Xu, H.; Liang, X.; Jiang, Y.; Li, Z. Loss function discovery for object detection via convergence-
simulation driven search. arXiv 2021, arXiv:2102.04700.
86. Muthukumar, V.; Narang, A.; Subramanian, V.; Belkin, M.; Hsu, D.; Sahai, A. Classification vs regression in overparameterized
regimes: Does the loss function matter? J. Mach. Learn. Res. 2021, 22, 1–69.
87. Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; Jiang, Y. Acquisition of localization confidence for accurate object detection. In Proceedings of
the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 784–799.
88. Sun, R. Optimization for deep learning: Theory and algorithms. arXiv 2019, arXiv:1912.08957.
89. Li, P. Optimization Algorithms for Deep Learning; Department of Systems Engineering and Engineering Management, The Chinese
University of Hong Kong: Hong Kong, 2017.
90. Soydaner, D. A comparison of optimization algorithms for deep learning. Int. J. Pattern Recognit. Artif. Intell. 2020, 34, 2052013.
[CrossRef]
Sensors 2023, 23, 4832 33 of 35

91. Darken, C.; Chang, J.; Moody, J. Learning rate schedules for faster stochastic gradient search. In Proceedings of the Neural
Networks for Signal Processing, Citeseer, 1992; Volume 2. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1
&type=pdf&doi=9db554243d7588589569aea127d676c9644d069a (accessed on 25 April 2023).
92. Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O (1/kˆ 2). Doklady an Ussr
1983, 269, 543–547.
93. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.
2011, 12, 2121–2159.
94. Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701.
95. Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA
Neural Netw. Mach. Learn. 2012, 4, 26–31.
96. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
97. Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M.; Ranzato, M.; Senior, A.; Tucker, P.; Yang, K.; et al. Large scale
distributed deep networks. Adv. Neural Inf. Process. Syst. 2012, 25. Available online: https://proceedings.neurips.cc/paper_files/
paper/2012/file/6aca97005c68f1206823815f66102863-Paper.pdf (accessed on 25 April 2023).
98. Mukkamala, M.C.; Hein, M. Variants of rmsprop and adagrad with logarithmic regret bounds. In Proceedings of the International
Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 2545–2553.
99. Zaheer, R.; Shaziya, H. A study of the optimization algorithms in deep learning. In Proceedings of the 2019 Third International
Conference on Inventive Systems and Control (ICISC), Coimbatore, India, 10–11 January 2019; pp. 536–539.
100. Javaid, S.; Wu, Z.; Fahim, H.; Mabrouk, I.B.; Al-Hasan, M.; Rasheed, M.B. Feedforward Neural Network-Based Data Aggregation
Scheme for Intrabody Area Nanonetworks. IEEE Syst. J. 2022, 16, 1796–1807. [CrossRef]
101. Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. arXiv 2019, arXiv:1905.05055.
102. Viola, P.; Jones, M. Rapid Object Detection using a Boosted Cascade of Simple. In Proceedings of the 2001 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, CVPR, Kauai, HI, USA, 8–14 December 2001.
103. Haselhoff, A.; Kummert, A. A vehicle detection system based on haar and triangle features. In Proceedings of the 2009 IEEE
Intelligent Vehicles Symposium, Xi’an, China, 3-5 June 2009; pp. 261–266.
104. Kim, K.J.; Kim, P.K.; Chung, Y.S.; Choi, D.H. Multi-scale detector for accurate vehicle detection in traffic surveillance data. IEEE
Access 2019, 7, 78311–78319. [CrossRef]
105. Chen, W.; Qiao, Y.; Li, Y. Inception-SSD: An improved single shot detector for vehicle detection. J. Ambient. Intell. Humaniz.
Comput. 2020, 13, 5047–5053. [CrossRef]
106. Zhao, M.; Zhong, Y.; Sun, D.; Chen, Y. Accurate and efficient vehicle detection framework based on SSD algorithm. IET Image
Process. 2021, 15, 3094–3104. [CrossRef]
107. Zhang, L.; Wang, H.; Wang, X.; Chen, S.; Wang, H.; Zheng, K. Vehicle object detection based on improved retinanet. In
Proceedings of the Journal of Physics: Conference Series, Nanchang, China, 26–28 October 2021; IOP Publishing: Bristol, UK,
2021; Volume 1757, p. 012070.
108. Wang, X.; Cheng, P.; Liu, X.; Uzochukwu, B. Focal loss dense detector for vehicle surveillance. In Proceedings of the 2018
International Conference on Intelligent Systems and Computer Vision (ISCV), Fez, Morocco, 2–4 April 2018; pp. 1–5.
109. Luo, J.q.; Fang, H.s.; Shao, F.m.; Zhong, Y.; Hua, X. Multi-scale traffic vehicle detection based on faster R–CNN with NAS
optimization and feature enrichment. Def. Technol. 2021, 17, 1542–1554. [CrossRef]
110. Arora, N.; Kumar, Y.; Karkra, R.; Kumar, M. Automatic vehicle detection system in different environment conditions using fast
R-CNN. Multimed. Tools Appl. 2022, 81, 18715–18735. [CrossRef]
111. Charouh, Z.; Ezzouhri, A.; Ghogho, M.; Guennoun, Z. A resource-efficient CNN-based method for moving vehicle detection.
Sensors 2022, 22, 1193. [CrossRef] [PubMed]
112. Rajput, S.K.; Patni, J.C.; Alshamrani, S.S.; Chaudhari, V.; Dumka, A.; Singh, R.; Rashid, M.; Gehlot, A.; AlGhamdi, A.S. Automatic
Vehicle Identification and Classification Model Using the YOLOv3 Algorithm for a Toll Management System. Sustainability 2022,
14, 9163. [CrossRef]
113. Amrouche, A.; Bentrcia, Y.; Abed, A.; Hezil, N. Vehicle Detection and Tracking in Real-time using YOLOv4-tiny. In Proceedings
of the 2022 7th International Conference on Image and Signal Processing and their Applications (ISPA), Mostaganem, Algeria,
8–9 May 2022; pp. 1–5.
114. Wang, Q.; Xu, N.; Huang, B.; Wang, G. Part-Aware Refinement Network for Occlusion Vehicle Detection. Electronics 2022, 11, 1375.
[CrossRef]
115. Farid, A.; Hussain, F.; Khan, K.; Shahzad, M.; Khan, U.; Mahmood, Z. A Fast and Accurate Real-Time Vehicle Detection Method
Using Deep Learning for Unconstrained Environments. Appl. Sci. 2023, 13, 3059. [CrossRef]
116. Huang, F.; Chen, S.; Wang, Q.; Chen, Y.; Zhang, D. Using deep learning in an embedded system for real-time target detection
based on images from an unmanned aerial vehicle: Vehicle detection as a case study. Int. J. Digit. Earth 2023, 16, 910–936.
[CrossRef]
117. Qiu, Z.; Bai, H.; Chen, T. Special Vehicle Detection from UAV Perspective via YOLO-GNS Based Deep Learning Network. Drones
2023, 7, 117. [CrossRef]
118. Zhang, Y.; Sun, Y.; Wang, Z.; Jiang, Y. YOLOv7-RAR for Urban Vehicle Detection. Sensors 2023, 23, 1801. [CrossRef]
Sensors 2023, 23, 4832 34 of 35

119. Mittal, U.; Chawla, P.; Tiwari, R. EnsembleNet: A hybrid approach for vehicle detection and estimation of traffic density based on
faster R-CNN and YOLO models. Neural Comput. Appl. 2023, 35, 4755–4774. [CrossRef]
120. Gupte, S.; Masoud, O.; Martin, R.F.; Papanikolopoulos, N.P. Detection and classification of vehicles. IEEE Trans. Intell. Transp.
Syst. 2002, 3, 37–47. [CrossRef]
121. Petrovic, V.S.; Cootes, T.F. Analysis of Features for Rigid Structure Vehicle Type Recognition. In Proceedings of the BMVC,
Kingston, UK, 7–9 September 2004; Kingston University: London, UK, 2004; Volume 2, pp. 587–596.
122. Psyllos, A.; Anagnostopoulos, C.N.; Kayafas, E. Vehicle model recognition from frontal view image measurements. Comput.
Stand. Interfaces 2011, 33, 142–151. [CrossRef]
123. Peng, Y.; Jin, J.S.; Luo, S.; Xu, M.; Au, S.; Zhang, Z.; Cui, Y. Vehicle type classification using data mining techniques. In The Era of
Interactive Media; Springer: Berlin/Heidelberg, Germany, 2013; pp. 325–335.
124. Dong, Z.; Wu, Y.; Pei, M.; Jia, Y. Vehicle type classification using a semisupervised convolutional neural network. IEEE Trans.
Intell. Transp. Syst. 2015, 16, 2247–2256. [CrossRef]
125. Awang, S.; Azmi, N.M.A.N.; Rahman, M.A. Vehicle type classification using an enhanced sparse-filtered convolutional neural
network with layer-skipping strategy. IEEE Access 2020, 8, 14265–14277. [CrossRef]
126. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al.
Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [CrossRef]
127. Maungmai, W.; Nuthong, C. Vehicle classification with deep learning. In Proceedings of the 2019 IEEE 4th International
Conference on Computer and Communication Systems (ICCCS), Singapore, 23–25 February 2019; pp. 294–298.
128. Wang, K.C.; Pranata, Y.D.; Wang, J.C. Automatic vehicle classification using center strengthened convolutional neural network.
In Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA
ASC), Kuala, Malaysia, 12–15 December 2017; pp. 1075–1078.
129. Fahim, H.; Javaid, S.; Li, W.; Mabrouk, I.B.; Hasan, M.A.; Rasheed, M.B.B. An Efficient Routing Scheme for Intrabody Nanonet-
works Using Artificial Bee Colony Algorithm. IEEE Access 2020, 8, 98946–98957. [CrossRef]
130. Jahan, N.; Islam, S.; Foysal, M.F.A. Real-Time Vehicle Classification Using CNN. In Proceedings of the 2020 11th International
Conference on Computing, Communication and Networking Technologies (ICCCNT), Kharagpur, India, 1–3 July 2020; pp. 1–6.
131. Taek Lee, J.; Chung, Y. Deep learning-based vehicle classification using an ensemble of local expert and global networks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July
2017; pp. 47–52.
132. Liu, W.; Zhang, M.; Luo, Z.; Cai, Y. An ensemble deep learning method for vehicle type classification on visual traffic surveillance
sensors. IEEE Access 2017, 5, 24417–24425. [CrossRef]
133. Jagannathan, P.; Rajkumar, S.; Frnda, J.; Divakarachari, P.B.; Subramani, P. Moving vehicle detection and classification using
gaussian mixture model and ensemble deep learning technique. Wirel. Commun. Mob. Comput. 2021, 2021, 5590894. [CrossRef]
134. Chen, W.; Chen, X.; Zhang, J.; Huang, K. A multi-task deep network for person re-identification. In Proceedings of the AAAI
Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31.
135. Liu, A.A.; Su, Y.T.; Nie, W.Z.; Kankanhalli, M. Hierarchical clustering multi-task learning for joint human action grouping and
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 102–114. [CrossRef]
136. Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A unified multi-scale deep convolutional neural network for fast object detection.
In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer:
Berlin/Heidelberg, Germany, 2016; pp. 354–370.
137. Kanacı, A.; Li, M.; Gong, S.; Rajamanoharan, G. Multi-task mutual learning for vehicle re-identification. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
138. Phillips, J.; Martinez, J.; Bârsan, I.A.; Casas, S.; Sadat, A.; Urtasun, R. Deep multi-task learning for joint localization, perception,
and prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June
2021; pp. 4679–4689.
139. Sang, J.; Wu, Z.; Guo, P.; Hu, H.; Xiang, H.; Zhang, Q.; Cai, B. An improved YOLOv2 for vehicle detection. Sensors 2018, 18, 4272.
[CrossRef] [PubMed]
140. Mansour, A.; Hassan, A.; Hussein, W.M.; Said, E. Automated vehicle detection in satellite images using deep learning. In
Proceedings of the International Conference on Aerospace Sciences and Aviation Technology, Cairo, Egypt, 9–11 April 2019; The
Military Technical College: Cairo, Egypt, 2019; Volume 18, pp. 1–8.
141. Sowmya, V.; Radha, R. Heavy-Vehicle Detection Based on YOLOv4 featuring Data Augmentation and Transfer-Learning
Techniques. In Proceedings of the Journal of Physics: Conference Series, Nanchang, China, 26–28 October 2021; IOP Publishing:
Bristol, UK, 2021; Volume 1911, p. 012029.
142. Wang, L.; Lu, Y.; Wang, H.; Zheng, Y.; Ye, H.; Xue, X. Evolving boxes for fast vehicle detection. In Proceedings of the 2017 IEEE
international conference on multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 1135–1140.
Sensors 2023, 23, 4832 35 of 35

143. Kim, K.J.; Kim, P.K.; Chung, Y.S.; Choi, D.H. Performance enhancement of yolov3 by adding prediction layers with spatial
pyramid pooling for vehicle detection. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and
Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6.
144. Wang, X.; Wang, S.; Cao, J.; Wang, Y. Data-driven based tiny-YOLOv3 method for front vehicle detection inducing SPP-net. IEEE
Access 2020, 8, 110227–110236. [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like