Smart Cities
Smart Cities
Smart Cities
Article
A Comparative Analysis of Multi-Label Deep Learning
Classifiers for Real-Time Vehicle Detection to Support
Intelligent Transportation Systems
Danesh Shokri 1,2, *, Christian Larouche 1,2 and Saeid Homayouni 3
1 Département des Sciences Géomatiques, Université Laval, Québec, QC G1V 0A6, Canada;
christian.larouche@scg.ulaval.ca
2 Centre de Recherche en Données et Intelligence Géospatiales (CRDIG), Université Laval,
Québec, QC G1V 0A6, Canada
3 Centre Eau Terre Environnement, Institut National de la Recherche Scientifique,
Québec, QC G1V 0A6, Canada; saeid.homayouni@inrs.ca
* Correspondence: danesh.shokri.1@ulaval.ca
Abstract: An Intelligent Transportation System (ITS) is a vital component of smart cities due to
the growing number of vehicles year after year. In the last decade, vehicle detection, as a primary
component of ITS, has attracted scientific attention because by knowing vehicle information (i.e., type,
size, numbers, location speed, etc.), the ITS parameters can be acquired. This has led to developing
and deploying numerous deep learning algorithms for vehicle detection. Single Shot Detector (SSD),
Region Convolutional Neural Network (RCNN), and You Only Look Once (YOLO) are three popular
deep structures for object detection, including vehicles. This study evaluated these methodologies on
nine fully challenging datasets to see their performance in diverse environments. Generally, YOLO
versions had the best performance in detecting and localizing vehicles compared to SSD and RCNN.
Between YOLO versions (YOLOv8, v7, v6, and v5), YOLOv7 has shown better detection and classifi-
cation (car, truck, bus) procedures, while slower response in computation time. The YOLO versions
Citation: Shokri, D.; Larouche, C.; have achieved more than 95% accuracy in detection and 90% in Overall Accuracy (OA) for the classi-
Homayouni, S. A Comparative fication of vehicles, including cars, trucks and buses. The computation time on the CPU processor
Analysis of Multi-Label Deep was between 150 milliseconds (YOLOv8, v6, and v5) and around 800 milliseconds (YOLOv7).
Learning Classifiers for Real-Time
Vehicle Detection to Support Keywords: Intelligent Transportation System (ITS); road traffic surveillance; vehicle detection and
Intelligent Transportation Systems. localization; deep neural network structures; highway cameras; smart cities
Smart Cities 2023, 6, 2982–3004.
https://doi.org/10.3390/
smartcities6050134
2023). Consequently, there are chances of individuals being badly injured in a crash, which
will raise problems like individuals’ mental and physical health or heavier financial burden.
Recently, the objective of obtaining an efficient ITS is closer to reality due to the ad-
vancements in data transmission by fifth Generation (5G) wireless technology [8,9]. This
technology transmits information with a speed of between 15 to 20 Gbps (Giga bytes per sec-
ond), with a latency 10 times better than that of 4G [10]. In addition, advancements in cloud
computing systems and Graphic Processing Units (GPUs) attract researchers’ attention and
enable them to monitor urban dynamics in real time [11]. Therefore, these developments
in hardware and data transmission provide an opportunity to enhance ITS structures and
implement state-of-art methodologies such as deep learning neural networks for vehicle
detection [12]. Vehicle detection is a prominent stage of ITS construction [1,13]. Deep
learning algorithms have shown very impressive performances in object detection and
classification using a great variety of resources, such as radiometric images [14,15], Light
Detection and Ranging (LiDAR) point clouds [16,17], and one-dimensional signals [18].
Face recognition, self-driving vehicles, and language translation are three popular applica-
tions of deep learning algorithms [19]. On the other hand, these algorithms are supervised
and need huge training datasets to detect objects.
Besides the challenges met by deep learning algorithms, a considerable positive side
is vehicle detection, as the most important parameter of ITS [20]. Knowing vehicles’
locations can enable us to measure any relevant component of a smart city or traffic
(i.e., density, traffic flow, speed). Consequently, various deep learning structures have
been proposed for vehicle detection, based on the four popular ITS methodologies of
cameras [21], LiDAR point clouds [22], wireless magnetics [23] and radar detectors [13].
In ITS, cameras have shown the most efficient and accurate performance due to their
ability to record contextual information, cover a larger area and be affordable, and, notably,
they are the only tool that can achieve license plate recognition in illegal situations [2,24].
Most importantly, deep learning structures work on two-dimensional camera images, and
do not require any transformation space between image sequences and deep learning
layers [25]. This adaptability between camera images and deep learning layers results
in the development of numerous algorithms for vehicle detection. Although several
morphological procedures, such as opening, have been suggested for vehicle detection,
their high sensitivity to illumination changes and weather conditions has made them
obsolete [24].
Generally, the deep learning algorithms that have been used most widely in vehicle
detection are (i) Single Shot MultiBox Detector (SSD) [26], (ii) Region-Based Convolutional
Neural Network (RCNN) [27], and lastly (iii), You Only Look Once (YOLO) [28]. Each of
these algorithms has its own benefits and drawbacks in object detection and localization.
For example, Faster RCNN, which is a branch of RCNN, shows a better performance in
the detection of small-scale objects, while it is not suitable for real-time object detection,
unlike SSD and YOLO (www.towardsdatacience.com, accessed on 12 September 2023).
Developers have released eight versions of YOLO (e.g., YOLOv1), four versions of RCNN
(i.e., RCNN, Mask, Fast, and Faster RCNN) and two SSD versions (i.e., SSD 512 and SSD
300) since 2015 [24,29,30]. For training these methodologies, various free-access benchmark
datasets such as COCO, ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and
PASCAL VOC, comprising more than 300,000 images, were used. A great variety of objects,
ranging from vehicles to animals, have been covered in recognition applications. Since
these algorithms have shown their excellent performance in object detection, researchers
have used the acquired weights of these deep learning algorithms in other fields, such as
the detection of sidewalk cracks [31], fish [31], and weeds [32]. This has been referred to as
transfer learning.
Kim, Sung [30] made a comparison between SSD, Faster RCNN, and YOLOv4 in
the context of detecting vehicles on road surfaces. Private cars, mini-vans, big vans, mini-
trucks, trucks and compact cars were set as the vehicle classes. The weights of deep learning
algorithms were adjusted by their training data. They concluded that the SSD had the
Smart Cities 2023, 6 2984
fastest performance, at 105 FPS (frame per second), while YOLOv4 reached the highest
classification accuracy, at around 98%. In a creative way, Li, Zhang [33] used SSD, RCNN
and YOLOv3 for transfer learning in the field of agricultural greenhouse detection. They
used high-resolution satellite images provided by Gaofen-2 with a spatial resolution of 2 m.
In this case, YOLOv3 achieved the highest performance in terms of both computational
time and acquired accuracy. Azimjonov and Özmen [34] suggested that if YOLO can be
combined with machine learning classifiers such as Support Vector Machine (SVM), the
final effect of YOLO on highway video cameras would increase sharply from about 57% to
around 95%.
Similarly, Han, Chang [35] increased YOLO’s accuracy by adding low and high fea-
tures to the YOLO network. The new network was called O-YOLOv2 and was evaluated
via application to a KITTI dataset, achieving around 94% accuracy. The studies of [7,36,37]
used SSD and RCNN in multi-object recognition, including vehicles.
In summary, previous studies have tried to assess the abovementioned deep learning
algorithms in various fields, particularly vehicle recognition, but there are still significant
gaps in their application for transportation system purposes. This means that these deep
learning structures have rarely been applied to highway cameras in challenging situations
such as nighttime. Indeed, this was the main motivation of this study, because vehicles can
hardly be seen on nighttime images. Previous works did not consider occlusion situations,
wherein parts of vehicles were not recorded by cameras. As occlusion may frequently
occur on busy roads, the algorithms should be robust in this context. Also, huge amounts
of training data and cloud computing systems were used, which is neither time-efficient
nor affordable. We have shown that there is no need to use and collect training data in
order to improve the efficiency of the mentioned deep learning algorithms. In addition,
illumination changes are a main challenge that has been mostly ignored by previous studies.
Since ITS should be robust and usable 24 h a day, the algorithms should show acceptable
performance at any time of the day and night. Finally, previous studies have rarely tested
their methodologies, including YOLO, SSD and RCNN, on diverse weather conditions
such as snowy and rainy days.
This study evaluates the state-of-the-art YOLO, RCNN, and SSD methodologies by
application to image sequences captured by highway cameras to detect and classify vehicles.
The algorithms must be able to detect any vehicle’s state, whether this be shape, color,
or even size. Noticeably, video cameras also record information during both night and
daytime. Consequently, the key contribution of this study is in making a clear comparison
between SSD, RCNN, and YOLO when applied to highway cameras in the following ways:
• Providing the most challenging highway videos. To make an acceptable assessment,
they must cover various states of vehicles, such as occlusion, weather conditions (i.e.,
rainy), low- to high-quality video frames, and different resolutions and illuminations
(images collected during the day and at night). Also, the videos must be recorded
from diverse viewing angles with cameras installed on top of road infrastructures, in
order to determine the best locations. Section 2 covers this first contribution;
• Making a comprehensive comparison between the deep learning algorithms in terms
of acquiring accuracy in both vehicle detection and classification. The vehicles are
categorized into the three classes of car, truck, and bus. The computation time of the
algorithms is also assessed to determine which one presents a better potential usability
in real-time situations. Section 3 covers this second contribution.
Figure1.1.Samples
Figure Samplesof
ofhighway
highway cameras
cameras located in Quebec, Canada.
Canada.
In
Inorder
orderto toperform
perform aa comprehensive assessmentassessment of of the
the methodologies,
methodologies,the theprocessed
processed
highway
highwayvideos videosshould
shouldcovercoverasasmany
many road
roadchallenges
challengesas possible. Therefore,
as possible. Therefore, the next high-
the next
quality image datasets
high-quality image were downloaded
datasets from YouTubefrom
were downloaded and KAGGLE
YouTube(www.kaggle.com,
and KAGGLE
accessed on 12 September
(www.kaggle.com, accessed 2023)
on platforms
12 September (Figure
2023)2).platforms
As can be(Figure
seen in2). Figure
As can 2, be
datasets
seen
IV and V include numerous vehicles, ranging from cars to buses,
in Figure 2, datasets IV and V include numerous vehicles, ranging from cars to buses, shown at night time.
The main attribute of these two datasets is that the vehicles’ headlights
shown at night time. The main attribute of these two datasets is that the vehicles’ head- are on, and the
vehicles are moving fast in both directions. In other datasets, the
lights are on, and the vehicles are moving fast in both directions. In other datasets, thecameras are relatively
close
camerasto thearecars on the roads,
relatively close towhich results
the cars on thein roads,
the collection of more
which results incontextual
the collectioninformation.
of more
Also, the angle
contextual of view Also,
information. of some of theofcameras
the angle (i.e., dataset
view of some VII) is (i.e.,
of the cameras not perpendicular
dataset VII) is
tonottheperpendicular
road infrastructure.
to the road Thisinfrastructure.
means that vehicles
This meansare recorded from
that vehicles area recorded
side view,from and
this increases
a side view, and the this
complexity
increasesofthethecomplexity
environments of theconsidered.
environments Shawon [38] released
considered. Shawona
video on KAGGLE,
[38] released a videowhich is an online
on KAGGLE, which community
is an onlineplatform
community for platform
data scientists,
for datainscien-
order
totists,
monitor
in order to monitor traffic flow (dataset II). These cameras produce images with of
traffic flow (dataset II). These cameras produce images with a resolution a
1364 ×
resolution768 pixels
of 1364at×a768frequency
pixels atofa25 FPS. Thisofhigh-quality
frequency 25 FPS. Thisvideo includesvideo
high-quality various British
includes
vehicles
various ranging from commercial
British vehicles ranging fromtrucks to private trucks
commercial cars. Another
to privatepositive side of this
cars. Another video
positive
isside
thatofthethisvehicles
video istherein
that themove on two
vehicles separate
therein moveroads
on two inseparate
oppositeroadsdirections, meaning
in opposite di-
that the algorithms
rections, meaning that evaluate front and rear
the algorithms vehicle
evaluate views.
front andThisrearvideo
vehicle alsoviews.
providesThismore
videoof
the contextual information of on-road vehicles, which may enable better vehicle detection
and classification. Another British highway dataset was released by Shah [39] including
frequent challenging traffic flow types (Dataset III). Table 1 provides a summary of the
information of the selected datasets in terms of pixel resolution, time of recording, etc. In
the “Section 5.1”, we provide precise and in-depth explanations about why these datasets
have been chosen.
also provides more of the contextual information of on-road vehicles, which may enable
better vehicle detection and classification. Another British highway dataset was released
by Shah [39] including frequent challenging traffic flow types (Dataset III). Table 1 pro-
vides a summary of the information of the selected datasets in terms of pixel resolution,
Smart Cities 2023, 6 time of recording, etc. In the “Section 5.1”, we provide precise and in-depth explanations
2986
about why these datasets have been chosen.
Figure 2. Views of the selected video datasets used for the evaluation of the deep learning structures.
Figure 2. Views of the selected video datasets used for the evaluation of the deep learning structures.
Table 1.
Table Specificationattributes
1. Specification attributesof
ofthe
theselected
selecteddatasets.
datasets.
Angle of View
Angle Link (accessed on
Dataset Link (accessed on
Dataset Day/Night
Day/Night Frames
Frames FPS
FPS Height WidthWidth
Height Rear/Front QualityQuality
Rear/Front of
(AoV)View 12 September2023)2023)
12 September
(AoV)
https://www.quebec
Dataset Vertical- https://www.quebec
DatasetI Both 2250 15 352 240 Both Low Vertical-Low-High 511.info/fr/Carte/De
Both 2250 15 352 240 Both Low Low- 511.info/fr/Carte/De
I fault.aspx
High fault.aspx
Dataset https://www.kaggle
https://www.kaggle.c
Day 1525 25 1364 768 Both Medium Low
II
Dataset .com/datasets/shaw
om/datasets/shawon
Day 1525 25 1364 768 Both Medium Low
II 10/road-traffic-video
-monitoring
https://www.kaggle.c
Dataset om/datasets/aryash
Day 250 10 320 240 Rear Low Vertical
III ah2k/highway-traffic
-videos-dataset
on10/road-traffic-
video-monitoring
https://www.kaggle
.com/datasets/aryas
Dataset
Smart Cities 2023, 6 2987
Day 250 10 320 240 Rear Low Vertical hah2k/highway-
III
traffic-videos-
dataset
Table 1. Cont.
https://www.youtu
Dataset
Night 61,840 30 1280 720 Both Very High Vertical
Angle be.com/watch?v=xE
IV Link tM1I1Afhc
(accessed on
Dataset Day/Night Frames FPS Height Width Rear/Front Quality of View
12 September 2023)
(AoV) https://www.youtu
Dataset
Night 178,125 25 1280 720 Both Low Vertical be.com/watch?v=iA
https://www.youtube.
V
Dataset
Night 61,840 30 1280 720 Both Very High Vertical com/watch?v=xEtM1I
0Tgng9v9U
IV
Dataset 1Afhc
https://youtu.be/Qu
Day 62,727 30 854 480 Rear Medium Vertical https://www.youtube.
Dataset
VI UxHIVUoaY
Night 178,125 25 1280 720 Both Low Vertical com/watch?v=iA0Tgn
V https://www.youtu
g9v9U
Dataset
Dataset Day
Day
9180
62,727
30
30
1920854 1080 480 Front RearVery High
Medium
Low
Vertical
be.com/watch?v=M
https://youtu.be/Q
VII
VI uUxHIVUoaY
Nn9qKG2UFI&t=7s
https://www.youtube.
Dataset
Dataset https://youtu.be/TW
Day
Day 9180
107,922 30
30 1280 Very High
1920 720 1080 Front Front High Low
High com/watch?v=MNn9
VII
VIII 3EH4cnFZo
qKG2UFI&t=7s
Dataset https://www.youtu
https://youtu.be/T
Dataset Day 107,922 30 1280 720 Front High High
VIII Day 1525 25 1280 720 Both High Vertical W3EH4cnFZo
be.com/watch?v=wq
IX https://www.youtube.
Dataset ctLW0Hb_0&t=10s
Day 1525 25 1280 720 Both High Vertical com/watch?v=wqct
IX
LW0Hb_0&t=10s
3. Deep Learning Methodologies Applied to Vehicle Detection
The deep neural network structures generally comprise training data, region pro-
3. Deep Learning Methodologies Applied to Vehicle Detection
posals, feature extraction, layer selection, and classifiers. Figure 3 presents an overview of
Thelearning
a deep deep neural network
structure usedstructures generally
for detecting and comprise training
identifying data,orregion
a pattern objectproposals,
in an im-
feature
age. extraction, layer selection, and classifiers. Figure 3 presents an overview of a deep
learning structure used for detecting and identifying a pattern or object in an image.
Figure 4. Training
Figure 4. Training data
data in
in order
order to
to feed
feed them
them into
into deep learning structures
deep learning structures [42];
[42]; (a)
(a) bounding
bounding boxes
boxes
around
aroundvehicles,
vehicles, (b)
(b) samples
samples from
from the COCO dataset
the COCO dataset used
used as
as training
trainingdata
data(www.coco.org,
(www.coco.org, accessed
accessed
on
on12
12 September
September 2023).
2023).
Region
Region Proposals
Proposals
Traditional
Traditional algorithms havesought
algorithms have soughttotoassess
assess individual
individual pixels
pixels fromfrom the inputted
the inputted im-
images
ages forfor
thethe sake
sake ofof objectdetection
object detectionand andlocalization.
localization.This
Thisprocess
processwas
wasshown
showntotobebetime-
time-
consuming
consumingdue duetotothe
therequired
requiredanalysis
analysisof of
thousands
thousands of pixels by deep
of pixels by deeplayer networks.
layer The
networks.
state-of-the-art methodologies represent several solutions to finding the
The state-of-the-art methodologies represent several solutions to finding the candidate candidate pixels,
instead of evaluating
pixels, instead pixel-by-pixel.
of evaluating For instance,
pixel-by-pixel. the twothe
For instance, SSDtwoand
SSDYOLO methods
and YOLO divide
methods
the input images into grids of the same length. This process will reduce
divide the input images into grids of the same length. This process will reduce the com- the computation
time sharply
putation timeassharply
it analyzes
as itonly a few only
analyzes cells, ainstead of thousands
few cells, instead ofofthousands
pixels. Assume image
of pixels. As-I
has
sumedimensions
image 𝐼 has of 300 × 300, and
dimensions ofthe grid
300 of SSD
× 300, 8 ×grid
andisthe 8. The computation
of SSD is 8 × 8. process will be
The computa-
decreased from 4 (300 × 300) to 64.
9 × 10approximately
tion process willapproximately
be decreased from 9 × 10 (300 × 300) to 64.
Feature
Feature Extraction
Extraction
Convolutional layers (Conv)
Convolutional (𝐶𝑜𝑛𝑣) represent a popular feature extraction extraction process
process because
because
they can be used
they can be used for for calculating features without any human supervision. In
In addition,
addition, thethe
convolutionallayers
convolutional layersprevent
preventoverfitting,
overfitting,which
whichisisa anoticeable
noticeableproblem
problem inin machine
machine learn-
learning
ing algorithms,
algorithms, as they
as they introduce
introduce flexibility
flexibility in feature
in feature learning.learning. A sample
A sample of feature
of feature extractionex-
tractionVGG16
model model isVGG16
shown is in
shown
Figurein Figure
5a. Via5a.trial
Via andtrial error,
and error, researchers
researchers havehavefound found
the
the optimal
optimal convolution
convolution size,
size, for example, 3 × 3 in
for example, 3× 3 in
SSD. SSD.these
Since Since thesemay
layers layers maynegative
feature feature
negative
values, anvalues, an activation
activation function isfunction is considered.
considered. Various functions
Various activation activation(Equations
functions (Equa-
(1)–(3))
tions (1)–(3)) have been proposed, and ReLU (Equation (1)) is an
have been proposed, and ReLU (Equation (1)) is an example thereof. The ReLU converts example thereof. The
ReLU
the converts
negative the negative
values values of thelayers
of the convolutional convolutional
into zero.layers into zero.
To reduce the hugeTo reduce
volumethe of
huge volumelayers,
convolution of convolution layers,ofthe
the two steps max twopooling
steps ofand maxstride
poolingareand stride are
required, required,
as shown in
Figure
as shown5b. in Figure 5b.
x𝑥 x𝑥 ≥ 0 0
ReLU ReLU: ) ==
: f ( x𝑓(𝑥) (1)(1)
00 x𝑥 < 0 0
1
Sigmoid : σ ( x ) = (2)
1 + e− x
sinhx e x − e− x
Hyperbolic Tangent : Tanh = = x (3)
coshx e + e− x
where x is the value of the convolution layer’s outputs.
Sigmoid: 𝜎(𝑥) = (2)
Figure
Figure 5. Feature
5. Feature extraction:
extraction: (a)VGG16
(a) the the VGG16
modelmodel used
used for for feature
feature extraction
extraction (www.towardsdat
(www.towardsdata-
ascience.com,
science.com, accessed
accessed on 12on 12 September
September 2023);2023); (b) applying
(b) applying poolingpooling
and and stride
stride 4 ×a 44 ×
to a to 4 image
image
(www.geeksforgeeks.org,
(www.geeksforgeeks.org, accessed
accessed on 12onSeptember
12 September 2023).
2023).
LayerLayer Selection
Selection andand Classifier
Classifier
After
After feature
feature extraction
extraction and pooling,
and pooling, the remaining
the remaining featuresfeatures are flattered
are flattered and fed and
into fed
into the deep-layer
the deep-layer neurons. neurons. For eacha neuron,
For each neuron, feature isa assigned.
feature isAfterward,
assigned. aAfterward,
full connec- a full
connectivity (FC) neural network, which links a neuron to all the neurons
tivity (𝐹𝐶) neural network, which links a neuron to all the neurons in the adjacent layer, in the adjacent
layer, is considered.
is considered. Next, which
Next, a classifier, a classifier, which by
can operate canmachine
operatelearning,
by machine suchlearning, such as
as a Support
Vector Machine (SVM) or a probabilistic model, is required to determine the object type. the
a Support Vector Machine (SVM) or a probabilistic model, is required to determine
Thisobject type.assigns
classifier This classifier
a value of {0, 1} afor
assigns value
each {0, 1}offor
ofobject each object
interest, of interest,
whereby whereby
the image class the
image
has the classscore.
highest has theTohighest
achievescore. To achieve
the best the best
performance performance
in object detection in and
object detection and
localization,
localization,
a Back Propagationa Back Propagation
(BP) (BP) step
step is needed, whichis needed,
measures which
the measures the weights
weights between between
neurons
and loss functions. The most popular loss functions are maximum likelihood, cross en-cross
neurons and loss functions. The most popular loss functions are maximum likelihood,
entropy,
tropy, and Mean
and Mean SquaredSquared
ErrorError
(MSE) (MSE)
[43]. [43]. The function
The loss loss function determines
determines the difference
the difference
between the predicted value of an image and its actual class. When
between the predicted value of an image and its actual class. When the loss function the loss function reaches
the minimum rate of difference, it is said that the deep learning algorithm works properly;
in any other case, the algorithm’s structure should be changed and adjusted to ensure
higher accuracy.
After setting out how a deep structure works, Figure 6 illustrates the flowcharts of
SSD [26], RCNN [44], and YOLO [45] as the most popular vehicle detection methodologies.
Each of these procedures features certain stages in the vehicle localization process, described
as follows.
Smart
Smart Cities 2023, 66, FOR PEER REVIEW
Cities 2023, 2990
10
Figure 6. Deep
Figure 6. Deep learning
learning structures
structures of
of (a)
(a) SSD
SSD [26],
[26], (b)
(b) RCNN
RCNN [44],
[44], and
and (c)
(c) YOLO
YOLO [45].
[45].
3.1. Single Shot Multi-Box Detector (SSD)
3.1. Single Shot Multi-Box Detector (SSD)
This algorithm was trained and evaluated on the two large free-access datasets Pascal
VOC This algorithm
(Pattern wasStatistical
Analysis, trained and evaluated
Modeling, onComputational
and the two large free-access datasetsObject
Learning—Visual Pascal
VOC (Pattern
Classes) Analysis,
and COCO, Statistical
and gained Modeling, and Computational
an mAP (mean Learning—Visual
average precision) score of moreObject
than
Classes) and COCO, and gained an mAP (mean average precision)
0.74%. SSD initially converts the inputted images, whether they comprise scoretraining
of moreorthan
test
0.74%. SSD initially converts the inputted images, whether they comprise training or test
Smart Cities 2023, 6 2991
data, into a feature map (grid) with a size of m × n (generally, the grid has dimensions
of 8 × 8). Then, multiple boxes of different sizes are placed around each cell. The sizes
and directions of these boxes are known. This is why it is called a multibox detector
algorithm. Afterwards, features are measured with the help of VGG16 as the base network,
due to its exceptional performance in classification and possession of several auxiliary
convolutional layers. These features help in measuring multiple boxes’ scores between
grids, and collecting ground truth data for each SSD class (i.e., vehicle, pedestrians).
The following equations (Equations (4)–(6)) show the process of score calcula-
tion for both ground truth boxes (d) and estimated ones (l ). Here, l refers to the pre-
dicted boxes around each cell. Most corresponding ground truth boxes with l are de-
tected by a matching strategy. The parameter of c is the class (i.e., vehicle, dog, cat,
etc.), N is the number of boxes matched with l, and α is considered equal to one by
cross-validation. Each box, whether ground truth or estimated, has four parameters of
{center o f x (cx ), center o f y (cy), width (w) and height (h)}.
1
L( x, c, l, g) = (L ( x, c) + αLloc ( x, l, g) ) (4)
N con f
N
Lloc ( x, l, g) = ∑ ∑ xijk smooth L1 (lim − ĝm
j )
i ∈ Pos m∈{cx,cy,w,h}
cy cy
gcx
j − di
cx
cy g j − di
ĝcx = ĝ j = (5)
j diw dih
w h
gj gj
ĝiw = log diw ĝih = log
dih
p
N
exp (ci )
Lcon f ( x, c) = − ∑i∈ Pos xij log ĉi where ĉi =
p p p
p (6)
∑ p exp (ci )
feature extraction step, with 53 convolutional layers, which is a deep neural network archi-
tecture commonly used for object detection and classification tasks. Increasing computation
time and detecting object accuracy was the main aim of YOLOv4 [48]. It verified the
negative effects of SOTA’s Bag-of-Freebies and Bag-of-Specials by use of COCO as the
training dataset. YOLOv5 has a lower volume, around 27 MB, in comparison with YOLOv4
(277 MB), both of which were released in 2020 [49]. YOLOv6 showed that if an anchor-free
procedure with Varifocal Length (VFL) is used throughout the training steps, the algorithm
can run 51% faster than other anchor-based methods [50]. YOLOv7 mainly focused on
generating accurate bounding boxes for detecting objects more precisely [51]. Recognizing
objects quickly was the foremost goal of YOLOv8, which employed a cutting-edge SOTA
model (www.ultralytics.com, accessed on 12 September 2023). This study will evaluate the
four last versions of YOLO, i.e., 5, 6, 7, and 8, for use in vehicle detection from highway
videos, because these versions are robust in detecting small objects and have optimized
computation times.
4. Experimental Results
4.1. Accuracy Evaluation
This step provides numerical information on how many vehicles have been correctly
detected. The Precision, Recall, and F1 Score accuracies are the most common aspects of
algorithm evaluation [54]. The three parameters of True Positive (TP), False Positive (FP),
and False Negative (FN) are required to measure accuracy. TP indicates the number of
vehicles detected correctly by the algorithms, while FN shows the number of non-vehicles
detected falsely as a vehicle. FP specifies the number of vehicles that have not been detected.
The Precision accuracy, based on the equations below (Equations (7)–(9)), refers to how
many vehicles in the datasets were detected properly, while Recall refers to what percentage
of the algorithm’s output was vehicles. F1 Score is a performance metric that balances
Precision and Recall.
In this stage, TP, FP, and FN are calculated for each individual frame, regardless
of whether a vehicle appears in various adjacent frames. Noticeably, as images often
include remote areas of a road wherein vehicles are rarely detected, a Region of Interest
(RoI) selection stage is needed. The vehicles in dataset II yield acceptable contextual
information, but the vehicles located in remote areas of dataset IV, for example, feature less
appropriate information. Therefore, ROI is a suitable tool that can be used to assess the
real performances of the deep learning algorithms in the context of vehicle detection and
classification; thus, the sections of the videos wherein vehicles represent less than 5% of the
frame size are not considered in the accuracy calculation because they offer little contextual
information. As shown in Table 2, which summarizes the acquired results, YOLOv7 showed
the best overall performance on nine datasets in terms of vehicle detection, with around
Smart Cities 2023, 6 2993
98% accuracy. The SSD and RCNN have not shown acceptable performances in the context
of vehicle detection, with about 58% and less than 2%, respectively.
True Positiv TP
Precision = = × 100 (7)
True Positive + False Positive TP + FP
True Positiv TP
Recall = = × 100 (8)
True Positive + False Negative TP + FN
Precision × Recall
F1 − score = 2 × × 100 (9)
Precision + Recall
Table 2. Results acquired by the deep learning structures applied to nine datasets.
user-friendly. In total, 200 vehicles were randomly selected, and we observed that the
YOLO versions achieved the best performance when used for localization estimation, with
RMSE values lower than 30 pixels. This value was more than 500 pixels for Faster RCNN.
Between the YOLO versions, YOLOv8 showed a weaker performance in localization esti-
mation. In the “Section 5.3”, we clearly compare the acquired and estimated localization
accuracies between the methods.
n q
RMSEi = ∑ ( Gcx − Pcx )2 i + Gcy − Pcy i + ( Gw − Pw )2 i + ( Gh − Ph )2 i
2
(10)
i =1
where n is the number of bounding boxes considered in calculating the different localization
accuracies between the estimations of the deep learning algorithms (P) and the ground
truth (G). Gcx and Gcy are, respectively, the centers of the bounding box on the x-axis and
y-axis, and Gw , and Gh are the width and height of the ground truth bounding boxes
(G). This is also true for the bounding boxes estimated by the algorithms. Pcx and Pcy are,
respectively, the centers of the bounding boxes on the x-axis and y-axis, and Pw and Ph are
the width and height of the estimated bounding boxes (G).
Figure 7. The computation times of YOLO versions and Faster RCNN over 1000 frames.
Figure 7. The computation times of YOLO versions and Faster RCNN over 1000 frames.
PCT + PCB
CEC = × 100 (12)
PCC + PCT + PCB
PTC + PTB
CET = × 100 (13)
PTC + PTT + PTB
PBC + PBT
CEB = × 100 (14)
PBC + PBT + PBB
PTC + PBC
OEC = × 100 (15)
PCC + PTC + PBC
PCT + PBT
OET = × 100 (16)
PCT + PTT + PBT
PCB + PTB
OEB = × 100 (17)
PCB + PTB + PBB
Actual
Car Truck Bus Commission Error
Car PCC PCT PCB CEC
Predicted Truck PTC PTT PTB CET
Bus PBC PBT PBB CEB
Omission Error OEC OET OEB Overall Accuracy
Smart Cities 2023, 6 2996
Table 4. Confusion matrix of YOLO versions used for the evaluation of vehicle classification. A grey
color is used to highlight the OA.
5. Discussion
5.1. Datasets Challenges and Advantages
This section addresses the challenges the selected video datasets met in covering an
important portion of the possible scenarios that could arise in traffic flow monitoring. These
various and representative datasets can be used to evaluate cutting-edge deep learning
vehicle detection algorithms more completely. First, illumination changes and shadow,
as the most important challenge met by radiometric cameras due to their sensitivity to
brightness, appear in datasets IV and V (Figure 8a,b). Secondly, a large variety of vehicles,
ranging from private cars with diverse sizes and colors to large heavy vehicles such as
Smart Cities 2023, 6 2997
buses, can be found in dataset VIII (Figure 8c) and dataset IV (Figure 8a). These vehicles
can be found in several countries, such as Canada and England. Notably, a range of fields
of view, set by different relations between the cameras and road surfaces (i.e., vertical,
low oblique, high oblique), were considered as they provide different types of contextual
information. For example, more vehicle bodies can be recorded by high oblique cameras
Smart Cities 2023, 6, FOR PEER REVIEW 18
(dataset VIII), while the top parts of vehicles can only be recorded by cameras with a
vertical view (dataset VII) (Figure 8c,d).
Figure 8. Datasets’
Figure 8. Datasets’ challenges;
challenges; (a,b)
(a,b) illumination
illumination and
and shadow;
shadow; (c,d)
(c,d) high
high oblique
oblique and
and low oblique
low oblique
angles of view; (e,f) scale variation of vehicles; (g,h) weather conditions such as foggy and rainy;
angles of view; (e,f) scale variation of vehicles; (g,h) weather conditions such as foggy and rainy; (i)
detection of vehicles that were occluded.
(i) detection of vehicles that were occluded.
conditions such as rainy and foggy to assess whether the algorithms can still detect vehicles
(Figure 8g,h). Finally, vehicle occlusion, shown in Figure 8i, is another parameter that
was covered by the datasets. The YOLOv7 algorithm seemed to work properly in such
situations of unclear or inaccurate information.
Model Size (Pixels) mAPval Speed CPU ONNX Speed A100 Tensor RT Params (M) FLOPs
YOLOv8n 640 37.3 80.4 0.99 3.2 8.7
YOLOv8s 640 44.9 128.4 1.20 11.2 28.6
YOLOv8m 640 50.2 234.7 1.83 25.9 78.9
YOLOv8l 640 52.9 375.2 2.39 43.7 165.2
YOLOv8x 640 53.9 479.1 3.53 68.2 257.8
YOLO versions (at about 98%). Neupane and Horanont [60] used the models produced by
the YOLO versions as the base for transfer learning when enhancing training data, similarly
to the previous works. In this case, they did not consider various cameras with different
resolutions, or even illumination changes. Also, the enhanced YOLOs were not assessed in
both night and daytime. A couple of studies on vehicle-board cameras have been published
in the context of the evaluation of deep learning in vehicle detection [35,36]. Since these
cameras have a completely different structures and fields of view from highway ones, this
is not an effective way to compare the algorithms.
In conclusion, there is no need for additional training data to enhance the performance
of YOLO versions in vehicle detection. The released versions of YOLO work effectively
in vehicle detection and classification, without any considerable errors in localization.
Heavy trucks are detected more accurately when the camera’s angle of view is oblique,
while private cars are detectable with precision from any direction of view. Noticeably, the
algorithm can be run in real-time situations if a GPU processor is used.
Author Contributions: Conceptualization, D.S. and C.L.; methodology, D.S., C.L. and S.H.; software,
D.S, C.L. and S.H.; validation, C.L. and S.H.; formal analysis, C.L. and S.H.; investigation, D.S.;
resources, D.S.; data curation, D.S.; writing—original draft preparation, D.S.; writing—review and
editing, C.L. and S.H.; visualization, D.S.; supervision, C.L. and S.H.; project administration, C.L.;
funding acquisition, C.L. and S.H. All authors have read and agreed to the published version of
the manuscript.
Funding: This research was funded by Mitacs grant number IT30935 and Semaphor.ai.
Data Availability Statement: Data sharing is not applicable to this paper.
Acknowledgments: The authors would like to thank all the individuals and organizations who
made these datasets and algorithms available. In particular, we want to express our sincere appreci-
ation and gratitude to the Semaphor.ai team and Mitacs for their funding support in making this
project possible.
Conflicts of Interest: The authors declare no conflict of interest.
Smart Cities 2023, 6 3002
References
1. Lv, Z.; Shang, W. Impacts of intelligent transportation systems on energy conservation and emission reduction of transport
systems: A comprehensive review. Green Technol. Sustain. 2023, 1, 100002. [CrossRef]
2. Pompigna, A.; Mauro, R. Smart roads: A state of the art of highways innovations in the Smart Age. Eng. Sci. Technol. Int. J. 2022,
25, 100986. [CrossRef]
3. Regragui, Y.; Moussa, N. A real-time path planning for reducing vehicles traveling time in cooperative-intelligent transportation
systems. Simul. Model. Pract. Theory 2023, 123, 102710. [CrossRef]
4. Wu, Y.; Wu, L.; Cai, H. A deep learning approach to secure vehicle to road side unit communications in intelligent transportation
system. Comput. Electr. Eng. 2023, 105, 108542. [CrossRef]
5. Zuo, J.; Dong, L.; Yang, F.; Guo, Z.; Wang, T.; Zuo, L. Energy harvesting solutions for railway transportation: A comprehensive
review. Renew. Energy 2023, 202, 56–87. [CrossRef]
6. Yang, Z.; Peng, J.; Wu, L.; Ma, C.; Zou, C.; Wei, N.; Zhang, Y.; Liu, Y.; Andre, M.; Li, D.; et al. Speed-guided intelligent
transportation system helps achieve low-carbon and green traffic: Evidence from real-world measurements. J. Clean. Prod. 2020,
268, 122230. [CrossRef]
7. Chen, Z.; Guo, H.; Yang, J.; Jiao, H.; Feng, Z.; Chen, L.; Gao, T. Fast vehicle detection algorithm in traffic scene based on improved
SSD. Measurement 2022, 201, 111655. [CrossRef]
8. Ribeiro, D.A.; Melgarejo, D.C.; Saadi, M.; Rosa, R.L.; Rodríguez, D.Z. A novel deep deterministic policy gradient model applied to
intelligent transportation system security problems in 5G and 6G network scenarios. Phys. Commun. 2023, 56, 101938. [CrossRef]
9. Sirohi, D.; Kumar, N.; Rana, P.S. Convolutional neural networks for 5G-enabled Intelligent Transportation System: A systematic
review. Comput. Commun. 2020, 153, 459–498. [CrossRef]
10. Lackner, T.; Hermann, J.; Dietrich, F.; Kuhn, C.; Angos, M.; Jooste, J.L.; Palm, D. Measurement and comparison of data rate
and time delay of end-devices in licensed sub-6 GHz 5G standalone non-public networks. Procedia CIRP 2022, 107, 1132–1137.
[CrossRef]
11. Wang, Y.; Cao, G.; Pan, L. Multiple-GPU accelerated high-order gas-kinetic scheme for direct numerical simulation of compressible
turbulence. J. Comput. Phys. 2023, 476, 111899. [CrossRef]
12. Sharma, H.; Kumar, N. Deep learning based physical layer security for terrestrial communications in 5G and beyond networks:
A survey. Phys. Commun. 2023, 57, 102002. [CrossRef]
13. Ounoughi, C.; Ben Yahia, S. Data fusion for ITS: A systematic literature review. Inf. Fusion 2023, 89, 267–291. [CrossRef]
14. Afat, S.; Herrmann, J.; Almansour, H.; Benkert, T.; Weiland, E.; Hölldobler, T.; Nikolaou, K.; Gassenmaier, S. Acquisition time
reduction of diffusion-weighted liver imaging using deep learning image reconstruction. Diagn. Interv. Imaging 2023, 104, 178–184.
[CrossRef] [PubMed]
15. Xu, M.; Yoon, S.; Fuentes, A.; Park, D.S. A Comprehensive Survey of Image Augmentation Techniques for Deep Learning. Pattern
Recognit. 2023, 137, 109347. [CrossRef]
16. Zhou, Y.; Ji, A.; Zhang, L.; Xue, X. Sampling-attention deep learning network with transfer learning for large-scale urban point
cloud semantic segmentation. Eng. Appl. Artif. Intell. 2023, 117, 105554. [CrossRef]
17. Yu, C.; Zhang, Z.; Li, H.; Sun, J.; Xu, Z. Meta-learning-based adversarial training for deep 3D face recognition on point clouds.
Pattern Recognit. 2023, 134, 109065. [CrossRef]
18. Kim, C.; Ahn, S.; Chae, K.; Hooker, J.; Rogachev, G. Noise signal identification in time projection chamber data using deep
learning model. Nucl. Instrum. Methods Phys. Res. Sect. A Accel. Spectrometers Detect. Assoc. Equip. 2023, 1048, 168025. [CrossRef]
19. Zhang, X.; Zhai, D.; Li, T.; Zhou, Y.; Lin, Y. Image inpainting based on deep learning: A review. Inf. Fusion 2023, 90, 74–94.
[CrossRef]
20. Mo, W.; Zhang, W.; Wei, H.; Cao, R.; Ke, Y.; Luo, Y. PVDet: Towards pedestrian and vehicle detection on gigapixel-level images.
Eng. Appl. Artif. Intell. 2023, 118, 105705. [CrossRef]
21. Bie, M.; Liu, Y.; Li, G.; Hong, J.; Li, J. Real-time vehicle detection algorithm based on a lightweight You-Only-Look-Once
(YOLOv5n-L) approach. Expert Syst. Appl. 2023, 213, 119108. [CrossRef]
22. Liang, Z.; Huang, Y.; Liu, Z. Efficient graph attentional network for 3D object detection from Frustum-based LiDAR point clouds.
J. Vis. Commun. Image Represent. 2022, 89, 103667. [CrossRef]
23. Tian, Y.; Guan, W.; Li, G.; Mehran, K.; Tian, J.; Xiang, L. A review on foreign object detection for magnetic coupling-based electric
vehicle wireless charging. Green Energy Intell. Transp. 2022, 1, 100007. [CrossRef]
24. Yang, Z.; Pun-Cheng, L.S. Vehicle detection in intelligent transportation systems and its applications under varying environments:
A review. Image Vis. Comput. 2018, 69, 143–154. [CrossRef]
25. Wang, Z.; Ma, Y.; Zhang, Y. Review of pixel-level remote sensing image fusion based on deep learning. Inf. Fusion 2023, 90, 36–58.
[CrossRef]
26. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision
and Pattern Recogniti; Springer International Publishing: Cham, Switzerland, 2016.
27. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and
Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [CrossRef]
28. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073.
[CrossRef]
Smart Cities 2023, 6 3003
29. Ramachandran, A.; Sangaiah, A.K. A review on object detection in unmanned aerial vehicle surveillance. Int. J. Cogn. Comput.
Eng. 2021, 2, 215–228. [CrossRef]
30. Kim, J.A.; Sung, J.Y.; Park, S.H. Comparison of Faster-RCNN, YOLO, and SSD for Real-Time Vehicle Type Recognition. In
Proceedings of the 2020 IEEE International Conference on Consumer Electronics—Asia (ICCE-Asia), Seoul, Republic of Korea,
1–3 November 2020.
31. Qiu, Q.; Lau, D. Real-time detection of cracks in tiled sidewalks using YOLO-based method applied to unmanned aerial vehicle
(UAV) images. Autom. Constr. 2023, 147, 104745. [CrossRef]
32. Dang, F.; Chen, D.; Lu, Y.; Li, Z. YOLOWeeds: A novel benchmark of YOLO object detectors for multi-class weed detection in
cotton production systems. Comput. Electron. Agric. 2023, 205, 107655. [CrossRef]
33. Li, M.; Zhang, Z.; Lei, L.; Wang, X.; Guo, X. Agricultural Greenhouses Detection in High-Resolution Satellite Images Based on
Convolutional Neural Networks: Comparison of Faster R-CNN, YOLO v3 and SSD. Sensors 2020, 20, 4938. [CrossRef] [PubMed]
34. Azimjonov, J.; Özmen, A. A real-time vehicle detection and a novel vehicle tracking systems for estimating and monitoring traffic
flow on highways. Adv. Eng. Inform. 2021, 50, 101393. [CrossRef]
35. Han, X.; Chang, J.; Wang, K. Real-time object detection based on YOLO-v2 for tiny vehicle object. Procedia Comput. Sci. 2021,
183, 61–72. [CrossRef]
36. Tao, C.; He, H.; Xu, F.; Cao, J. Stereo priori RCNN based car detection on point level for autonomous driving. Knowl. -Based Syst.
2021, 229, 107346. [CrossRef]
37. Zhang, Q.; Hu, X.; Yue, Y.; Gu, Y.; Sun, Y. Multi-object detection at night for traffic investigations based on improved SSD
framework. Heliyon 2022, 8, e11570. [CrossRef]
38. Shawon, A. Road Traffic Video Monitoring. 2020. Available online: https://www.kaggle.com/datasets/shawon10/road-traffic-v
ideo-monitoring?select=traffic_detection.mp4 (accessed on 1 January 2021).
39. Shah, A. Highway Traffic Videos Dataset. 2020. Available online: https://www.kaggle.com/datasets/aryashah2k/highway-tra
ffic-videos-dataset (accessed on 1 March 2020).
40. Saha, S. A Comprehensive Guide to Convolutional Neural Networks—The ELI5 Way. 2018. Available online: https://towa
rdsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 (accessed on 15
December 2018).
41. Ding, J.; Li, X.; Kang, X.; Gudivada, V.N. Augmentation and evaluation of training data for deep learning. In Proceedings of the
2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017.
42. Phuong, T.M.; Diep, N.N. Speeding Up Convolutional Object Detection for Traffic Surveillance Videos. In Proceedings of the 2018
10th International Conference on Knowledge and Systems Engineering (KSE), Ho Chi Minh City, Vietnam, 1–3 November 2018.
43. Tian, Y.; Su, D.; Lauria, S.; Liu, X. Recent advances on loss functions in deep learning for computer vision. Neurocomputing 2022,
497, 129–158. [CrossRef]
44. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014.
45. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016.
46. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
47. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
48. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
49. Jocher, G. Yolov5. Code Repository. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 July 2020).
50. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection
framework for industrial applications. arXiv 2022, arXiv:2209.02976.
51. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. arXiv 2022, arXiv:2207.02696.
52. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13
December 2015.
53. Chen, X.; Gupta, A. An implementation of faster rcnn with study for region sampling. arXiv 2017, arXiv:1702.02138.
54. Powers, D.M. Evaluation: From precision, Recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020,
arXiv:2010.16061.
55. Vostrikov, A.; Chernyshev, S. Training sample generation software. In Intelligent Decision Technologies 2019, Proceedings of the
11th KES International Conference on Intelligent Decision Technologies (KES-IDT 2019), St. Julians, Malta, 17–19 June 2019; Springer:
Berlin/Heidelberg, Germany, 2019; Volume 2.
56. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Part V
13; Springer: Berlin/Heidelberg, Germany, 2014.
57. Bathija, A.; Sharma, G. Visual object detection and tracking using Yolo and sort. Int. J. Eng. Res. Technol. 2019, 8, 705–708.
58. Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. arXiv 2023,
arXiv:2304.00501.
Smart Cities 2023, 6 3004
59. Song, H.; Liang, H.; Li, H.; Dai, Z.; Yun, X. Vision-based vehicle detection and counting system using deep learning in highway
scenes. Eur. Transp. Res. Rev. 2019, 11, 51. [CrossRef]
60. Neupane, B.; Horanont, T.; Aryal, J. Real-Time Vehicle Classification and Tracking Using a Transfer Learning-Improved Deep
Learning Network. Sensors 2022, 22, 3813. [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.