Smart Cities

smart cities
Article
A Comparative Analysis of Multi-Label Deep Learning
Classifiers for Real-Time Vehicle Detection to Support
Intelligent Transportation Systems
Danesh Shokri 1,2, *, Christian Larouche 1,2 and Saeid Homayouni 3
1 Département des Sciences Géomatiques, Université Laval, Québec, QC G1V 0A6, Canada;
christian.larouche@scg.ulaval.ca
2 Centre de Recherche en Données et Intelligence Géospatiales (CRDIG), Université Laval,
Québec, QC G1V 0A6, Canada
3 Centre Eau Terre Environnement, Institut National de la Recherche Scientifique,
Québec, QC G1V 0A6, Canada; saeid.homayouni@inrs.ca
* Correspondence: danesh.shokri.1@ulaval.ca
Abstract: An Intelligent Transportation System (ITS) is a vital component of smart cities due to
the growing number of vehicles year after year. In the last decade, vehicle detection, as a primary
component of ITS, has attracted scientific attention because by knowing vehicle information (i.e., type,
size, numbers, location speed, etc.), the ITS parameters can be acquired. This has led to developing
and deploying numerous deep learning algorithms for vehicle detection. Single Shot Detector (SSD),
Region Convolutional Neural Network (RCNN), and You Only Look Once (YOLO) are three popular
deep structures for object detection, including vehicles. This study evaluated these methodologies on
nine fully challenging datasets to see their performance in diverse environments. Generally, YOLO
versions had the best performance in detecting and localizing vehicles compared to SSD and RCNN.
Between YOLO versions (YOLOv8, v7, v6, and v5), YOLOv7 has shown better detection and classifi-
cation (car, truck, bus) procedures, while slower response in computation time. The YOLO versions
Citation: Shokri, D.; Larouche, C.; have achieved more than 95% accuracy in detection and 90% in Overall Accuracy (OA) for the classi-
Homayouni, S. A Comparative fication of vehicles, including cars, trucks and buses. The computation time on the CPU processor
Analysis of Multi-Label Deep was between 150 milliseconds (YOLOv8, v6, and v5) and around 800 milliseconds (YOLOv7).
Learning Classifiers for Real-Time
Vehicle Detection to Support Keywords: Intelligent Transportation System (ITS); road traffic surveillance; vehicle detection and
Intelligent Transportation Systems. localization; deep neural network structures; highway cameras; smart cities
Smart Cities 2023, 6, 2982–3004.
https://doi.org/10.3390/
smartcities6050134
Academic Editors: Katarzyna Turoń 1. Introduction

and Andrzej Kubik Nowadays, having a robust and efficient Intelligent Transportation System (ITS) is a
Received: 13 September 2023
key component of metropolises in relation to (1) dropping consuming fuels, (2) monitoring
Revised: 17 October 2023 the emission of carbon dioxide, (3) reducing time wasted behind stop lights, (4) optimizing
Accepted: 18 October 2023 traffic flow, (5) extending the life of road infrastructures, and noticeably, (6) rationalizing
Published: 23 October 2023 parking space [1,2]. ITS proposes solutions for not only environmental issues like air
pollution, but also other issues like time-wasting and costs maintenance when modeling;
therefore, urban planners use ITS mainly for environmental concerns, traffic congestion, and
economic purposes [3–5]. In addressing environmental concerns, the air quality index will
Copyright: © 2023 by the authors. increase if vehicles, especially private cars, spend less time waiting at red lights, as vehicles
Licensee MDPI, Basel, Switzerland.
tend to produce more emissions when stopping. It reduces carbon dioxide emission, as
This article is an open access article
the largest factor in producing greenhouse gas [6]. By counting and tracking vehicles, ITS
distributed under the terms and
can predict intersections’ density for controlling traffic light systems and easing traffic
conditions of the Creative Commons
jams [3,7]. Therefore, an efficient and smooth transportation system is achieved as a result.
Attribution (CC BY) license (https://
As estimated by UNICEF, traffic congestion wastes more than USD 100 billion annually,
creativecommons.org/licenses/by/
4.0/).
whether by private or state organizations (www.UNICEF.org, accessed on 12 September
Smart Cities 2023, 6, 2982–3004. https://doi.org/10.3390/smartcities6050134 https://www.mdpi.com/journal/smartcities

Smart Cities 2023, 6 2983
2023). Consequently, there are chances of individuals being badly injured in a crash, which
will raise problems like individuals’ mental and physical health or heavier financial burden.
Recently, the objective of obtaining an efficient ITS is closer to reality due to the ad-
vancements in data transmission by fifth Generation (5G) wireless technology [8,9]. This
technology transmits information with a speed of between 15 to 20 Gbps (Giga bytes per sec-
ond), with a latency 10 times better than that of 4G [10]. In addition, advancements in cloud
computing systems and Graphic Processing Units (GPUs) attract researchers’ attention and
enable them to monitor urban dynamics in real time [11]. Therefore, these developments
in hardware and data transmission provide an opportunity to enhance ITS structures and
implement state-of-art methodologies such as deep learning neural networks for vehicle
detection [12]. Vehicle detection is a prominent stage of ITS construction [1,13]. Deep
learning algorithms have shown very impressive performances in object detection and
classification using a great variety of resources, such as radiometric images [14,15], Light
Detection and Ranging (LiDAR) point clouds [16,17], and one-dimensional signals [18].
Face recognition, self-driving vehicles, and language translation are three popular applica-
tions of deep learning algorithms [19]. On the other hand, these algorithms are supervised
and need huge training datasets to detect objects.
Besides the challenges met by deep learning algorithms, a considerable positive side
is vehicle detection, as the most important parameter of ITS [20]. Knowing vehicles’
locations can enable us to measure any relevant component of a smart city or traffic
(i.e., density, traffic flow, speed). Consequently, various deep learning structures have
been proposed for vehicle detection, based on the four popular ITS methodologies of
cameras [21], LiDAR point clouds [22], wireless magnetics [23] and radar detectors [13].
In ITS, cameras have shown the most efficient and accurate performance due to their
ability to record contextual information, cover a larger area and be affordable, and, notably,
they are the only tool that can achieve license plate recognition in illegal situations [2,24].
Most importantly, deep learning structures work on two-dimensional camera images, and
do not require any transformation space between image sequences and deep learning
layers [25]. This adaptability between camera images and deep learning layers results
in the development of numerous algorithms for vehicle detection. Although several
morphological procedures, such as opening, have been suggested for vehicle detection,
their high sensitivity to illumination changes and weather conditions has made them
obsolete [24].
Generally, the deep learning algorithms that have been used most widely in vehicle
detection are (i) Single Shot MultiBox Detector (SSD) [26], (ii) Region-Based Convolutional
Neural Network (RCNN) [27], and lastly (iii), You Only Look Once (YOLO) [28]. Each of
these algorithms has its own benefits and drawbacks in object detection and localization.
For example, Faster RCNN, which is a branch of RCNN, shows a better performance in
the detection of small-scale objects, while it is not suitable for real-time object detection,
unlike SSD and YOLO (www.towardsdatacience.com, accessed on 12 September 2023).
Developers have released eight versions of YOLO (e.g., YOLOv1), four versions of RCNN
(i.e., RCNN, Mask, Fast, and Faster RCNN) and two SSD versions (i.e., SSD 512 and SSD
300) since 2015 [24,29,30]. For training these methodologies, various free-access benchmark
datasets such as COCO, ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and
PASCAL VOC, comprising more than 300,000 images, were used. A great variety of objects,
ranging from vehicles to animals, have been covered in recognition applications. Since
these algorithms have shown their excellent performance in object detection, researchers
have used the acquired weights of these deep learning algorithms in other fields, such as
the detection of sidewalk cracks [31], fish [31], and weeds [32]. This has been referred to as
transfer learning.
Kim, Sung [30] made a comparison between SSD, Faster RCNN, and YOLOv4 in
the context of detecting vehicles on road surfaces. Private cars, mini-vans, big vans, mini-
trucks, trucks and compact cars were set as the vehicle classes. The weights of deep learning
algorithms were adjusted by their training data. They concluded that the SSD had the
fastest performance, at 105 FPS (frame per second), while YOLOv4 reached the highest
classification accuracy, at around 98%. In a creative way, Li, Zhang [33] used SSD, RCNN
and YOLOv3 for transfer learning in the field of agricultural greenhouse detection. They
used high-resolution satellite images provided by Gaofen-2 with a spatial resolution of 2 m.
In this case, YOLOv3 achieved the highest performance in terms of both computational
time and acquired accuracy. Azimjonov and Özmen [34] suggested that if YOLO can be
combined with machine learning classifiers such as Support Vector Machine (SVM), the
final effect of YOLO on highway video cameras would increase sharply from about 57% to
around 95%.
Similarly, Han, Chang [35] increased YOLO’s accuracy by adding low and high fea-
tures to the YOLO network. The new network was called O-YOLOv2 and was evaluated
via application to a KITTI dataset, achieving around 94% accuracy. The studies of [7,36,37]
used SSD and RCNN in multi-object recognition, including vehicles.
In summary, previous studies have tried to assess the abovementioned deep learning
algorithms in various fields, particularly vehicle recognition, but there are still significant
gaps in their application for transportation system purposes. This means that these deep
learning structures have rarely been applied to highway cameras in challenging situations
such as nighttime. Indeed, this was the main motivation of this study, because vehicles can
hardly be seen on nighttime images. Previous works did not consider occlusion situations,
wherein parts of vehicles were not recorded by cameras. As occlusion may frequently
occur on busy roads, the algorithms should be robust in this context. Also, huge amounts
of training data and cloud computing systems were used, which is neither time-efficient
nor affordable. We have shown that there is no need to use and collect training data in
order to improve the efficiency of the mentioned deep learning algorithms. In addition,
illumination changes are a main challenge that has been mostly ignored by previous studies.
Since ITS should be robust and usable 24 h a day, the algorithms should show acceptable
performance at any time of the day and night. Finally, previous studies have rarely tested
their methodologies, including YOLO, SSD and RCNN, on diverse weather conditions
such as snowy and rainy days.
This study evaluates the state-of-the-art YOLO, RCNN, and SSD methodologies by
application to image sequences captured by highway cameras to detect and classify vehicles.
The algorithms must be able to detect any vehicle’s state, whether this be shape, color,
or even size. Noticeably, video cameras also record information during both night and
daytime. Consequently, the key contribution of this study is in making a clear comparison
between SSD, RCNN, and YOLO when applied to highway cameras in the following ways:
• Providing the most challenging highway videos. To make an acceptable assessment,
they must cover various states of vehicles, such as occlusion, weather conditions (i.e.,
rainy), low- to high-quality video frames, and different resolutions and illuminations
(images collected during the day and at night). Also, the videos must be recorded
from diverse viewing angles with cameras installed on top of road infrastructures, in
order to determine the best locations. Section 2 covers this first contribution;
• Making a comprehensive comparison between the deep learning algorithms in terms
of acquiring accuracy in both vehicle detection and classification. The vehicles are
categorized into the three classes of car, truck, and bus. The computation time of the
algorithms is also assessed to determine which one presents a better potential usability
in real-time situations. Section 3 covers this second contribution.
2. Traffic Video Data

The organization “Ministère des Transports et de la Mobilité durable du Québec”,
located in the Province of Quebec, Canada, has established numerous online highway
cameras. Figure 1 shows nine samples of images acquired with those cameras. The cameras
work 24 h a day, covering multiple road lanes under all illumination and weather conditions.
Therefore, these cameras are fairly suitable for use in our assessment of the deep learning
structures relevant to vehicle detection and localization on highways. The vehicles appear
2. Traffic Video Data
The organization “Ministère des Transports et de la Mobilité durable du Québec”,
located in the Province of Quebec, Canada, has established numerous online highway
cameras. Figure 1 shows nine samples of images acquired with those cameras. The cam-
eras work 24 h a day, covering multiple road lanes under all illumination and weather
conditions. Therefore, these cameras are fairly suitable for use in our assessment of the
deep learning structures relevant to vehicle detection and localization on highways. The
vehicles appear relatively small on the images, which offer little contextual information
relatively
due to the small on theofimages,
distance whichfrom
the cameras offer the
littleroad
contextual
surface information
and the lowdue to the distance
resolution of the
ofcamera.
the cameras from the road
The dimension surface
of each frameand 352low
is the × 240,resolution
and theof the camera.
images The dimension
were recorded on 14
ofJanuary 2023.is 352 × 240, and the images were recorded on 14 January 2023.
each frame
Figure1.1.Samples
Figure Samplesof
ofhighway
highway cameras
cameras located in Quebec, Canada.
Canada.
In
Inorder
orderto toperform
perform aa comprehensive assessmentassessment of of the
the methodologies,
methodologies,the theprocessed
processed
highway
highwayvideos videosshould
shouldcovercoverasasmany
many road
roadchallenges
challengesas possible. Therefore,
as possible. Therefore, the next high-
the next
quality image datasets
high-quality image were downloaded
datasets from YouTubefrom
were downloaded and KAGGLE
YouTube(www.kaggle.com,
and KAGGLE
accessed on 12 September
(www.kaggle.com, accessed 2023)
on platforms
12 September (Figure
2023)2).platforms
As can be(Figure
seen in2). Figure
As can 2, be
datasets
seen
IV and V include numerous vehicles, ranging from cars to buses,
in Figure 2, datasets IV and V include numerous vehicles, ranging from cars to buses, shown at night time.
The main attribute of these two datasets is that the vehicles’ headlights
shown at night time. The main attribute of these two datasets is that the vehicles’ head- are on, and the
vehicles are moving fast in both directions. In other datasets, the
lights are on, and the vehicles are moving fast in both directions. In other datasets, thecameras are relatively
close
camerasto thearecars on the roads,
relatively close towhich results
the cars on thein roads,
the collection of more
which results incontextual
the collectioninformation.
of more
Also, the angle
contextual of view Also,
information. of some of theofcameras
the angle (i.e., dataset
view of some VII) is (i.e.,
of the cameras not perpendicular
dataset VII) is
tonottheperpendicular
road infrastructure.
to the road Thisinfrastructure.
means that vehicles
This meansare recorded from
that vehicles area recorded
side view,from and
this increases
a side view, and the this
complexity
increasesofthethecomplexity
environments of theconsidered.
environments Shawon [38] released
considered. Shawona
video on KAGGLE,
[38] released a videowhich is an online
on KAGGLE, which community
is an onlineplatform
community for platform
data scientists,
for datainscien-
order
totists,
monitor
in order to monitor traffic flow (dataset II). These cameras produce images with of
traffic flow (dataset II). These cameras produce images with a resolution a
1364 ×
resolution768 pixels
of 1364at×a768frequency
pixels atofa25 FPS. Thisofhigh-quality
frequency 25 FPS. Thisvideo includesvideo
high-quality various British
includes
vehicles
various ranging from commercial
British vehicles ranging fromtrucks to private trucks
commercial cars. Another
to privatepositive side of this
cars. Another video
positive
isside
thatofthethisvehicles
video istherein
that themove on two
vehicles separate
therein moveroads
on two inseparate
oppositeroadsdirections, meaning
in opposite di-
that the algorithms
rections, meaning that evaluate front and rear
the algorithms vehicle
evaluate views.
front andThisrearvideo
vehicle alsoviews.
providesThismore
videoof
the contextual information of on-road vehicles, which may enable better vehicle detection
and classification. Another British highway dataset was released by Shah [39] including
frequent challenging traffic flow types (Dataset III). Table 1 provides a summary of the
information of the selected datasets in terms of pixel resolution, time of recording, etc. In
the “Section 5.1”, we provide precise and in-depth explanations about why these datasets
have been chosen.
also provides more of the contextual information of on-road vehicles, which may enable
better vehicle detection and classification. Another British highway dataset was released
by Shah [39] including frequent challenging traffic flow types (Dataset III). Table 1 pro-
vides a summary of the information of the selected datasets in terms of pixel resolution,
Smart Cities 2023, 6 time of recording, etc. In the “Section 5.1”, we provide precise and in-depth explanations
2986
about why these datasets have been chosen.
Figure 2. Views of the selected video datasets used for the evaluation of the deep learning structures.
Figure 2. Views of the selected video datasets used for the evaluation of the deep learning structures.
Table 1.
Table Specificationattributes
1. Specification attributesof
ofthe
theselected
selecteddatasets.
datasets.
Angle of View
Angle Link (accessed on
Dataset Link (accessed on
Dataset Day/Night
Day/Night Frames
Frames FPS
FPS Height WidthWidth
Height Rear/Front QualityQuality
Rear/Front of
(AoV)View 12 September2023)2023)
12 September
(AoV)
https://www.quebec
Dataset Vertical- https://www.quebec
DatasetI Both 2250 15 352 240 Both Low Vertical-Low-High 511.info/fr/Carte/De
Both 2250 15 352 240 Both Low Low- 511.info/fr/Carte/De
I fault.aspx
High fault.aspx
Dataset https://www.kaggle
https://www.kaggle.c
Day 1525 25 1364 768 Both Medium Low
II
Dataset .com/datasets/shaw
om/datasets/shawon
Day 1525 25 1364 768 Both Medium Low
II 10/road-traffic-video
-monitoring
https://www.kaggle.c
Dataset om/datasets/aryash
Day 250 10 320 240 Rear Low Vertical
III ah2k/highway-traffic
-videos-dataset
on10/road-traffic-
video-monitoring
https://www.kaggle
.com/datasets/aryas
Dataset
Day 250 10 320 240 Rear Low Vertical hah2k/highway-
III
traffic-videos-
dataset
Table 1. Cont.
https://www.youtu
Dataset
Night 61,840 30 1280 720 Both Very High Vertical
Angle be.com/watch?v=xE
IV Link tM1I1Afhc
(accessed on
Dataset Day/Night Frames FPS Height Width Rear/Front Quality of View
12 September 2023)
(AoV) https://www.youtu
Dataset
Night 178,125 25 1280 720 Both Low Vertical be.com/watch?v=iA
https://www.youtube.
V
Dataset
Night 61,840 30 1280 720 Both Very High Vertical com/watch?v=xEtM1I
0Tgng9v9U
IV
Dataset 1Afhc
https://youtu.be/Qu
Day 62,727 30 854 480 Rear Medium Vertical https://www.youtube.
Dataset
VI UxHIVUoaY
Night 178,125 25 1280 720 Both Low Vertical com/watch?v=iA0Tgn
V https://www.youtu
g9v9U
Dataset
Dataset Day
Day
9180
62,727
30
30
1920854 1080 480 Front RearVery High
Medium
Low
Vertical
be.com/watch?v=M
https://youtu.be/Q
VII
VI uUxHIVUoaY
Nn9qKG2UFI&t=7s
https://www.youtube.
Dataset
Dataset https://youtu.be/TW
Day
Day 9180
107,922 30
30 1280 Very High
1920 720 1080 Front Front High Low
High com/watch?v=MNn9
VII
VIII 3EH4cnFZo
qKG2UFI&t=7s
Dataset https://www.youtu
https://youtu.be/T
Dataset Day 107,922 30 1280 720 Front High High
VIII Day 1525 25 1280 720 Both High Vertical W3EH4cnFZo
be.com/watch?v=wq
IX https://www.youtube.
Dataset ctLW0Hb_0&t=10s
Day 1525 25 1280 720 Both High Vertical com/watch?v=wqct
IX
LW0Hb_0&t=10s
3. Deep Learning Methodologies Applied to Vehicle Detection
The deep neural network structures generally comprise training data, region pro-
3. Deep Learning Methodologies Applied to Vehicle Detection
posals, feature extraction, layer selection, and classifiers. Figure 3 presents an overview of
Thelearning
a deep deep neural network
structure usedstructures generally
for detecting and comprise training
identifying data,orregion
a pattern objectproposals,
in an im-
feature
age. extraction, layer selection, and classifiers. Figure 3 presents an overview of a deep
learning structure used for detecting and identifying a pattern or object in an image.
Figure 3. An overview of a deep learning structure [40].

Figure 3. An overview of a deep learning structure [40].
Training Data
Training Data
Deep learning algorithms are mostly supervised and require pre-defined data, mean-
Deep
ing that thelearning
images ofalgorithms arethe
vehicles, as mostly supervised
research focus ofandthisrequire pre-defined
study, should data,
first be mean-
provided
ing that the images of vehicles, as the research focus of this study, should
to the deep learning network. This process is undertaken by drawing boxes around eachfirst be provided
to the deep
known learning
object, such asnetwork.
in FigureThis
4a. process
As longisasundertaken
the training bydata
drawing boxes around each
are comprehensive and
known
cover anyobject, such
vehicle as in
state, Figurefrom
ranging 4a. As long
color to as the the
type, training data will
algorithm are comprehensive and
detect vehicles pre-
cisely. In relation to this, various inclusive free-access datasets have been released; COCO
(Common Objects in Common) and ILSVRC (Large Scale Visual Recognition Challenge)
are two samples thereof. These are widely used as benchmark datasets in computer vision.
They contain labeled images featuring diverse object categories, object annotations, and
segmentation masks for object detection and recognition tasks. Figure 4b shows 400 pre-
defined images from COCO, which include humans, animals, houses, and vehicles. In
cover any vehicle state, ranging from color to type, the algorithm will detect vehicles pre-
cisely. In relation to this, various inclusive free-access datasets have been released; COCO
(Common Objects in Common) and ILSVRC (Large Scale Visual Recognition Challenge)
are two samples thereof. These are widely used as benchmark datasets in computer vision.
They contain labeled images featuring diverse object categories, object annotations, and
segmentation masks for object detection and recognition tasks. Figure 4b shows 400 pre-
defined images from COCO, which include humans, animals, houses, and vehicles. In or-
order
der totoimprove
improvethe theamount
amountofofdata
dataand
andthe
themodel’s
model’sprediction
predictionaccuracy,
accuracy,augmentation
augmentation is is
suggested,
suggested,which
which applies
applies varying
varying angular and rotation steps to the original data [41].
[41]. This
This
also
also results
results in
in reducing
reducing the
the cost
cost of
of labeling
labeling data,
data, and
and generates
generates variability and flexibility.
Figure 4. Training
Figure 4. Training data
data in
in order
order to
to feed
feed them
them into
into deep learning structures
deep learning structures [42];
[42]; (a)
(a) bounding
bounding boxes
boxes
around
aroundvehicles,
vehicles, (b)
(b) samples
samples from
from the COCO dataset
the COCO dataset used
used as
as training
trainingdata
data(www.coco.org,
(www.coco.org, accessed
accessed
on
on12
12 September
September 2023).
2023).
Region
Region Proposals
Proposals
Traditional
Traditional algorithms havesought
algorithms have soughttotoassess
assess individual
individual pixels
pixels fromfrom the inputted
the inputted im-
images
ages forfor
thethe sake
sake ofof objectdetection
object detectionand andlocalization.
localization.This
Thisprocess
processwas
wasshown
showntotobebetime-
time-
consuming
consumingdue duetotothe
therequired
requiredanalysis
analysisof of
thousands
thousands of pixels by deep
of pixels by deeplayer networks.
layer The
networks.
state-of-the-art methodologies represent several solutions to finding the
The state-of-the-art methodologies represent several solutions to finding the candidate candidate pixels,
instead of evaluating
pixels, instead pixel-by-pixel.
of evaluating For instance,
pixel-by-pixel. the twothe
For instance, SSDtwoand
SSDYOLO methods
and YOLO divide
methods
the input images into grids of the same length. This process will reduce
divide the input images into grids of the same length. This process will reduce the com- the computation
time sharply
putation timeassharply
it analyzes
as itonly a few only
analyzes cells, ainstead of thousands
few cells, instead ofofthousands
pixels. Assume image
of pixels. As-I
has
sumedimensions
image 𝐼 has of 300 × 300, and
dimensions ofthe grid
300 of SSD
× 300, 8 ×grid
andisthe 8. The computation
of SSD is 8 × 8. process will be
The computa-
decreased from 4 (300 × 300) to 64.
9 × 10approximately
tion process willapproximately
be decreased from 9 × 10 (300 × 300) to 64.
Feature
Feature Extraction
Extraction
Convolutional layers (Conv)
Convolutional (𝐶𝑜𝑛𝑣) represent a popular feature extraction extraction process
process because
because
they can be used
they can be used for for calculating features without any human supervision. In
In addition,
addition, thethe
convolutionallayers
convolutional layersprevent
preventoverfitting,
overfitting,which
whichisisa anoticeable
noticeableproblem
problem inin machine
machine learn-
learning
ing algorithms,
algorithms, as they
as they introduce
introduce flexibility
flexibility in feature
in feature learning.learning. A sample
A sample of feature
of feature extractionex-
tractionVGG16
model model isVGG16
shown is in
shown
Figurein Figure
5a. Via5a.trial
Via andtrial error,
and error, researchers
researchers havehavefound found
the
the optimal
optimal convolution
convolution size,
size, for example, 3 × 3 in
for example, 3× 3 in
SSD. SSD.these
Since Since thesemay
layers layers maynegative
feature feature
negative
values, anvalues, an activation
activation function isfunction is considered.
considered. Various functions
Various activation activation(Equations
functions (Equa-
(1)–(3))
tions (1)–(3)) have been proposed, and ReLU (Equation (1)) is an
have been proposed, and ReLU (Equation (1)) is an example thereof. The ReLU converts example thereof. The
ReLU
the converts
negative the negative
values values of thelayers
of the convolutional convolutional
into zero.layers into zero.
To reduce the hugeTo reduce
volumethe of
huge volumelayers,
convolution of convolution layers,ofthe
the two steps max twopooling
steps ofand maxstride
poolingareand stride are
required, required,
as shown in
Figure
as shown5b. in Figure 5b.
x𝑥 x𝑥 ≥ 0 0
ReLU ReLU: ) ==
: f ( x𝑓(𝑥) (1)(1)
00 x𝑥 < 0 0
1
Sigmoid : σ ( x ) = (2)
1 + e− x
sinhx e x − e− x
Hyperbolic Tangent : Tanh = = x (3)
coshx e + e− x
where x is the value of the convolution layer’s outputs.
Sigmoid: 𝜎(𝑥) = (2)
Hyperbolic Tangent: 𝑇𝑎𝑛ℎ = = (3)

Smart Cities 2023, 6 where x is the value of the convolution layer’s outputs. 2989
Figure
Figure 5. Feature
5. Feature extraction:
extraction: (a)VGG16
(a) the the VGG16
modelmodel used
used for for feature
feature extraction
extraction (www.towardsdat
(www.towardsdata-
ascience.com,
science.com, accessed
accessed on 12on 12 September
September 2023);2023); (b) applying
(b) applying poolingpooling
and and stride
stride 4 ×a 44 ×
to a to 4 image
image
(www.geeksforgeeks.org,
(www.geeksforgeeks.org, accessed
accessed on 12onSeptember
12 September 2023).
2023).
LayerLayer Selection
Selection andand Classifier
Classifier
After
After feature
feature extraction
extraction and pooling,
and pooling, the remaining
the remaining featuresfeatures are flattered
are flattered and fed and
into fed
into the deep-layer
the deep-layer neurons. neurons. For eacha neuron,
For each neuron, feature isa assigned.
feature isAfterward,
assigned. aAfterward,
full connec- a full
connectivity (FC) neural network, which links a neuron to all the neurons
tivity (𝐹𝐶) neural network, which links a neuron to all the neurons in the adjacent layer, in the adjacent
layer, is considered.
is considered. Next, which
Next, a classifier, a classifier, which by
can operate canmachine
operatelearning,
by machine suchlearning, such as
as a Support
Vector Machine (SVM) or a probabilistic model, is required to determine the object type. the
a Support Vector Machine (SVM) or a probabilistic model, is required to determine
Thisobject type.assigns
classifier This classifier
a value of {0, 1} afor
assigns value
each {0, 1}offor
ofobject each object
interest, of interest,
whereby whereby
the image class the
image
has the classscore.
highest has theTohighest
achievescore. To achieve
the best the best
performance performance
in object detection in and
object detection and
localization,
localization,
a Back Propagationa Back Propagation
(BP) (BP) step
step is needed, whichis needed,
measures which
the measures the weights
weights between between
neurons
and loss functions. The most popular loss functions are maximum likelihood, cross en-cross
neurons and loss functions. The most popular loss functions are maximum likelihood,
entropy,
tropy, and Mean
and Mean SquaredSquared
ErrorError
(MSE) (MSE)
[43]. [43]. The function
The loss loss function determines
determines the difference
the difference
between the predicted value of an image and its actual class. When
between the predicted value of an image and its actual class. When the loss function the loss function reaches
the minimum rate of difference, it is said that the deep learning algorithm works properly;
in any other case, the algorithm’s structure should be changed and adjusted to ensure
higher accuracy.
After setting out how a deep structure works, Figure 6 illustrates the flowcharts of
SSD [26], RCNN [44], and YOLO [45] as the most popular vehicle detection methodologies.
Each of these procedures features certain stages in the vehicle localization process, described
as follows.
Smart
Smart Cities 2023, 66, FOR PEER REVIEW
Cities 2023, 2990
10
Figure 6. Deep
Figure 6. Deep learning
learning structures
structures of
of (a)
(a) SSD
SSD [26],
[26], (b)
(b) RCNN
RCNN [44],
[44], and
and (c)
(c) YOLO
YOLO [45].
[45].
3.1. Single Shot Multi-Box Detector (SSD)
3.1. Single Shot Multi-Box Detector (SSD)
This algorithm was trained and evaluated on the two large free-access datasets Pascal
VOC This algorithm
(Pattern wasStatistical
Analysis, trained and evaluated
Modeling, onComputational
and the two large free-access datasetsObject
Learning—Visual Pascal
VOC (Pattern
Classes) Analysis,
and COCO, Statistical
and gained Modeling, and Computational
an mAP (mean Learning—Visual
average precision) score of moreObject
than
Classes) and COCO, and gained an mAP (mean average precision)
0.74%. SSD initially converts the inputted images, whether they comprise scoretraining
of moreorthan
test
0.74%. SSD initially converts the inputted images, whether they comprise training or test
data, into a feature map (grid) with a size of m × n (generally, the grid has dimensions
of 8 × 8). Then, multiple boxes of different sizes are placed around each cell. The sizes
and directions of these boxes are known. This is why it is called a multibox detector
algorithm. Afterwards, features are measured with the help of VGG16 as the base network,
due to its exceptional performance in classification and possession of several auxiliary
convolutional layers. These features help in measuring multiple boxes’ scores between
grids, and collecting ground truth data for each SSD class (i.e., vehicle, pedestrians).
The following equations (Equations (4)–(6)) show the process of score calcula-
tion for both ground truth boxes (d) and estimated ones (l ). Here, l refers to the pre-
dicted boxes around each cell. Most corresponding ground truth boxes with l are de-
tected by a matching strategy. The parameter of c is the class (i.e., vehicle, dog, cat,
etc.), N is the number of boxes matched with l, and α is considered equal to one by
cross-validation. Each box, whether ground truth or estimated, has four parameters of
{center o f x (cx ), center o f y (cy), width (w) and height (h)}.
1
L( x, c, l, g) = (L ( x, c) + αLloc ( x, l, g) ) (4)
N con f
N
Lloc ( x, l, g) = ∑ ∑ xijk smooth L1 (lim − ĝm
j )
i ∈ Pos m∈{cx,cy,w,h}
cy cy
gcx
j − di
cx
cy g j − di
ĝcx = ĝ j = (5)
j diw dih
w h

gj gj
ĝiw = log diw ĝih = log
dih
p
N
exp (ci )
Lcon f ( x, c) = − ∑i∈ Pos xij log ĉi where ĉi =
p p p
p (6)
∑ p exp (ci )
3.2. You Only Look Once (YOLO)

Like SSD, YOLO first converts the inputted data into S × S grids (7 × 7) of the same
length and measure several bounding boxes around each grid cell (two boxes). Then,
the five parameters of cx, cy, w, h and a con f idence score are measured. The first four
parameters represent the bounding box localization, but the con f idence score refers to the
maximum percentage of overlap between the YOLO bounding boxes and the ground truth
boxes. This overlap is assessed by the Intersection Over Union (IOU) methodology, and its
output is a probability value for each cell. If we consider the number of classes that YOLO
can detect to be equal to 20, then 7 × 7 × (2 × 5 + 20) = 1470 values can be measured for
each image. Afterwards, 26 convolutional layers are selected, of which the last two are fully
connected. A 1 × 1 convolutional layer, like GoogleNet, is used for reducing the feature
space. Notably, a tiny YOLO has also been released with nine convolutional layers, which
is faster than YOLO. The final output of YOLO is a class probability, according to which a
value between zero and one is assigned to each class. Therefore, the class with the highest
value will be considered the class of the bounding box. As an image may contain no objects,
boxes with a con f idence score of near zero are eliminated, and are not considered in the
next stages (this reduces the computation time). YOLO and tiny YOLO have achieved
around 63% and 52% mAP accuracy, respectively, on the VOC 2007 dataset, but a 70%
mAP accuracy when applied on VOC 2012. The main positive of this method is its ability
to extract objects in 45 FPS (Frame Per Second), which means it is suitable for real-time
object detection.
To date, eight versions of YOLO have been released for use in object extraction.
YOLOv2 includes fine-grained features to improve the accuracy of detecting small ob-
jects [46]. Its capacity to detect small and multiscale objects was the main drawback of
YOLOv1 [45]. Yolov3 uses logistic classifiers to assign a score to each class, while the
previous ones used a SoftMax procedure [47]. YOLOv3 also employs Darknet-53 as the
feature extraction step, with 53 convolutional layers, which is a deep neural network archi-
tecture commonly used for object detection and classification tasks. Increasing computation
time and detecting object accuracy was the main aim of YOLOv4 [48]. It verified the
negative effects of SOTA’s Bag-of-Freebies and Bag-of-Specials by use of COCO as the
training dataset. YOLOv5 has a lower volume, around 27 MB, in comparison with YOLOv4
(277 MB), both of which were released in 2020 [49]. YOLOv6 showed that if an anchor-free
procedure with Varifocal Length (VFL) is used throughout the training steps, the algorithm
can run 51% faster than other anchor-based methods [50]. YOLOv7 mainly focused on
generating accurate bounding boxes for detecting objects more precisely [51]. Recognizing
objects quickly was the foremost goal of YOLOv8, which employed a cutting-edge SOTA
model (www.ultralytics.com, accessed on 12 September 2023). This study will evaluate the
four last versions of YOLO, i.e., 5, 6, 7, and 8, for use in vehicle detection from highway
videos, because these versions are robust in detecting small objects and have optimized
computation times.
3.3. Region-Based Convolutional Neural Network (RCNN)

The semantic segmentation (region proposals) of images is the first step of RCNN
in the context of decreasing computation time [44]. The four similarities of texture, size,
fill, and color are key to initial image segmentation. Afterward, Convolutional Neural
Networks (CNN) are applied to each selected segment to extract its features. In this case of
feature extraction, pre-trained CNN structures, such as ResNet, VGG19, and EfficientNet,
can be used. Finally, multiple SVM procedures are trained to classify the extracted features
as a specific object, like a vehicle. Fast-RCNN simultaneously applies a CNN structure to
the whole of the inputted image and merges it with the region proposal [52]. This results
in the extraction of more features from the region’s proposal segments. Faster RCNN
uses a region proposal neural network structure instead of similarity conditions [53]. This
selective search algorithm, based on neural networks, directly impacts the generation of
high-quality region proposals. As the Faster RCNN does not use similarity conditions and
is an end-to-end algorithm, the two versions of RCNN and Fast-RCNN are not addressed
in this study.
4. Experimental Results
4.1. Accuracy Evaluation
This step provides numerical information on how many vehicles have been correctly
detected. The Precision, Recall, and F1 Score accuracies are the most common aspects of
algorithm evaluation [54]. The three parameters of True Positive (TP), False Positive (FP),
and False Negative (FN) are required to measure accuracy. TP indicates the number of
vehicles detected correctly by the algorithms, while FN shows the number of non-vehicles
detected falsely as a vehicle. FP specifies the number of vehicles that have not been detected.
The Precision accuracy, based on the equations below (Equations (7)–(9)), refers to how
many vehicles in the datasets were detected properly, while Recall refers to what percentage
of the algorithm’s output was vehicles. F1 Score is a performance metric that balances
Precision and Recall.
In this stage, TP, FP, and FN are calculated for each individual frame, regardless
of whether a vehicle appears in various adjacent frames. Noticeably, as images often
include remote areas of a road wherein vehicles are rarely detected, a Region of Interest
(RoI) selection stage is needed. The vehicles in dataset II yield acceptable contextual
information, but the vehicles located in remote areas of dataset IV, for example, feature less
appropriate information. Therefore, ROI is a suitable tool that can be used to assess the
real performances of the deep learning algorithms in the context of vehicle detection and
classification; thus, the sections of the videos wherein vehicles represent less than 5% of the
frame size are not considered in the accuracy calculation because they offer little contextual
information. As shown in Table 2, which summarizes the acquired results, YOLOv7 showed
the best overall performance on nine datasets in terms of vehicle detection, with around
98% accuracy. The SSD and RCNN have not shown acceptable performances in the context
of vehicle detection, with about 58% and less than 2%, respectively.
True Positiv TP
Precision = = × 100 (7)
True Positive + False Positive TP + FP
True Positiv TP
Recall = = × 100 (8)
True Positive + False Negative TP + FN
Precision × Recall
F1 − score = 2 × × 100 (9)
Precision + Recall
Table 2. Results acquired by the deep learning structures applied to nine datasets.
Yolov8 Yolov7 Yolov6 Yolov5 Faster RCNN SSD

Dataset I Precision 43.03 96.33 48.96 54.98 2.00< 2.00<
Recall 55.41 100.00 78.39 65.40 2.00< 2.00<
F1-score 48.44 98.13 60.27 59.74 2.00< 2.00<
Dataset II Precision 99.38 100.00 100.00 100.00 92.49 2.00<
Recall 100.00 100.00 96.78 99.11 100.00 2.00<
F1-score 99.69 100.00 98.36 99.55 96.10 2.00<
Dataset III Precision 87.84 97.36 98.25 96.74 2.00< 2.00<
Recall 83.56 100.00 99.69 100.00 2.00< 2.00<
F1-score 85.65 98.66 98.96 98.34 2.00< 2.00<
Dataset IV Precision 98.42 100.00 100.00 100.00 37.24 2.00<
Recall 99.68 99.47 96.54 96.55 98.44 2.00<
F1-score 99.05 99.73 98.24 98.24 54.04 2.00<
Dataset V Precision 96.33 97.77 95.87 94.38 2.00< 2.00<
Recall 97.96 98.69 93.14 96.73 2.00< 2.00<
F1-score 97.14 98.23 94.49 95.54 2.00< 2.00<
Dataset VI Precision 100.00 100.00 99.18 100.00 88.92 2.00<
Recall 96.57 99.98 100.00 99.23 100.00 2.00<
F1-score 98.26 99.99 99.59 99.61 94.14 2.00<
Dataset VII Precision 99.82 99.85 98.67 100.00 97.57 2.00<
Recall 78.36 86.14 80.22 85.64 98.61 2.00<
F1-score 87.80 92.49 88.49 92.26 98.09 2.00<
Dataset VIII Precision 96.28 99.43 97.65 99.44 96.73 2.00<
Recall 56.47 100.00 80.25 97.82 99.37 2.00<
F1-score 71.19 99.71 88.10 98.62 98.03 2.00<
Dataset IX Precision 100.00 98.23 98.00 99.11 73.29 2.00<
Recall 93.66 98.37 84.36 99.86 85.37 2.00<
F1-score 96.73 98.30 91.52 99.48 78.87 2.00<
Average Precision 91.23 98.77 92.95 93.85 54.69 2.00<
Recall 84.63 98.07 89.93 93.37 65.31 2.00<
F1-score 87.10 98.42 91.42 93.61 58.36 2.00<
4.2. Localization Accuracy

The localization accuracy refers to how precisely the algorithms can estimate the posi-
tions of vehicles. The Root Mean Square Error (RMSE), as the most common method used
in position evaluation,
shows the amount of difference between the bounding
box predicted
with parameters of Pcx , Pcy , Pw , Ph and the ground truth bounding box Gcx , Gcy , Gw , Gh .
As the ground truth datasets were unavailable, the bounding boxes around vehicles were
drawn by an expert. The library of labels in the Python environment [55], a useful appli-
cation for bounding box-drawing (called labeling), has been used because it is fast and
user-friendly. In total, 200 vehicles were randomly selected, and we observed that the
YOLO versions achieved the best performance when used for localization estimation, with
RMSE values lower than 30 pixels. This value was more than 500 pixels for Faster RCNN.
Between the YOLO versions, YOLOv8 showed a weaker performance in localization esti-
mation. In the “Section 5.3”, we clearly compare the acquired and estimated localization
accuracies between the methods.
n q
RMSEi = ∑ ( Gcx − Pcx )2 i + Gcy − Pcy i + ( Gw − Pw )2 i + ( Gh − Ph )2 i
2
(10)
i =1
where n is the number of bounding boxes considered in calculating the different localization
accuracies between the estimations of the deep learning algorithms (P) and the ground
truth (G). Gcx and Gcy are, respectively, the centers of the bounding box on the x-axis and
y-axis, and Gw , and Gh are the width and height of the ground truth bounding boxes
(G). This is also true for the bounding boxes estimated by the algorithms. Pcx and Pcy are,
respectively, the centers of the bounding boxes on the x-axis and y-axis, and Pw and Ph are
the width and height of the estimated bounding boxes (G).
4.3. Running Time

In terms of computational time, it is necessary to run the deep learning codes in
similar environments to assess which algorithm achieves faster in vehicle detection. Here,
we used a personal laptop computer system with the following specifications: Windows
10, RAM 16 G DDR3, Processors of Intel (R) Core (TM) i7-4700 HQ CPU @ 2.4 GHz.
Python was used as the programming environment. During the processing of the deep
learning code, other non-relevant apps, such as web browsers, that required RAM or CPU
resources were turned off to avoid a slowdown of the processing time. Only the CPU
was considered in the analysis of the processing time. Other processors, the RAM, and
the Graphics Processing Unit (GPU) were neglected. It is worth mentioning that using
GPU can sharply reduce the computation time and render the algorithms suitable for the
real-time monitoring of roadway infrastructures. Assuming a camera records 30 frames per
second (FPS = 30), an algorithm can be used in real-time monitoring if it extracts vehicles
at less than 1/30 s = 0.033 s or 33 milliseconds (ms) from each frame. Figure 7 shows the
computation times of YOLO versions and Faster RCNN on 1000 frames. As can be seen,
Smart Cities 2023, 6, FOR PEER REVIEWYOLOv5 and YOLOv8 achieved the best performance, at around 100 ms, while Faster 15
RCNN took about 2.5 s per frame. YOLOv7, which showed the best overall performance
out of the nine datasets in vehicle detection, required around 800 ms per frame.
Figure 7. The computation times of YOLO versions and Faster RCNN over 1000 frames.
Figure 7. The computation times of YOLO versions and Faster RCNN over 1000 frames.
4.4. Vehicle Classification

The capacity to determine the class of each detected vehicle (i.e., private car, truck,
bus) is a strength of the deep learning structures. This is because of the availability of free-
access datasets such as COCO [56], which provide not only localization information using
4.4. Vehicle Classification

The capacity to determine the class of each detected vehicle (i.e., private car, truck,
bus) is a strength of the deep learning structures. This is because of the availability of
free-access datasets such as COCO [56], which provide not only localization information
using bounding boxes, but also the types of each of the objects. Section 4.1 shows how the
deep learning algorithms work frame by frame, regardless of whether a vehicle appears in
multiple consecutive frames [57].
In order to evaluate errors in the classification procedure, a confusion matrix is used.
The parameters of the proposed confusion matrix can be seen in Table 3. This matrix
has two axes corresponding to an object’s actual and predicted values. For example, the
column of the car has three parameters of { PCC , PTC , PBC }, the sum of which equals the real
number of vehicles in the region. But the row of cars { PCC , PCT , PCB } displays how many
objects were detected as cars by the algorithms. The diagonal cells of { PCC , PTT , PBB } that
have been highlighted in green are true positive values, representing correctly classified
vehicles. The non-diagonal cells highlighted in orange are falsely classified vehicles. The
following three parameters, Commission Error (CE), Overall Accuracy (OA), and Omission
Error (OE), estimate the accuracy of the classification. Equations (11)–(17) show how the
OA, CE, and OE parameters are measured to complete the confusion matrix. This algorithm
performs best when the OA is near 100%, and the CE and OE are close to 0%. Table 4
displays the confusion matrix evaluated for each dataset, wherein the OA of each region
is highlighted in grey. Similar to the vehicle detection results obtained in Section 4.1, the
YOLOv7 algorithm again showed the best performance in object classification. In the
Section 5.3, the acquired confusion matrixes are assessed in greater depth in order to better
understand the algorithm’s performance.
PCC + PTT + PBB

Overall Accuracy = × 100 (11)
PCC + PCT + PCB + PTC + PTT + PTB + PBC + PBT + PBB
PCT + PCB
CEC = × 100 (12)
PCC + PCT + PCB
PTC + PTB
CET = × 100 (13)
PTC + PTT + PTB
PBC + PBT
CEB = × 100 (14)
PBC + PBT + PBB
PTC + PBC
OEC = × 100 (15)
PCC + PTC + PBC
PCT + PBT
OET = × 100 (16)
PCT + PTT + PBT
PCB + PTB
OEB = × 100 (17)
PCB + PTB + PBB
Table 3. Parameters of a confusion matrix used for classification accuracy evaluation.
Actual
Car Truck Bus Commission Error
Car PCC PCT PCB CEC
Predicted Truck PTC PTT PTB CET
Bus PBC PBT PBB CEB
Omission Error OEC OET OEB Overall Accuracy
Table 4. Confusion matrix of YOLO versions used for the evaluation of vehicle classification. A grey
color is used to highlight the OA.
Yolov8 Yolov7 Yolov6 Yolov5

Car Truck Bus Car Truck Bus Car Truck Bus Car Truck Bus
Car 115 12 0 9.45 255 3 0 1.16 131 6 0 4.38 144 6 0 4.00
Dataset
Truck 3 14 0 17.65 7 29 0 19.44 3 22 0 12.00 12 17 0 41.38
I
Bus 0 0 0 N/A 0 0 0 N/A 0 0 0 N/A 0 3 0 N/A
2.54 46.15 N/A 88.97 2.67 9.38 N/A 96.60 2.24 21.43 N/A 94.44 7.69 34.62 N/A 89.94
Car 63 0 0 0.00 63 0 0 0.00 62 3 0 4.62 60 0 0 0.00
Dataset
Truck 0 10 0 0.00 0 11 0 0.00 1 8 0 11.11 3 11 0 21.43
II
Bus 0 1 0 N/A 0 0 0 N/A 0 0 0 N/A 0 0 0 N/A
d 0.00 9.09 N/A 98.65 0.00 0.00 N/A 100.00 1.59 27.27 N/A 94.59 4.76 0.00 N/A 95.95
Car 50 4 0 7.41 59 1 0 1.67 51 4 0 7.27 49 6 0 10.91
Dataset
Truck 2 10 0 16.67 2 11 0 15.38 3 11 0 21.43 1 12 0 7.69
III
Bus 2 3 0 N/A 0 0 0 N/A 0 1 0 N/A 3 4 0 N/A
7.41 41.18 N/A 84.51 3.28 8.33 N/A 95.89 5.56 26.67 N/A 88.57 7.55 45.45 N/A 81.33
Car 1405 66 17 5.58 1402 55 7 4.23 1410 62 14 5.11 1408 60 6 4.48
Dataset
Truck 9 264 16 8.65 12 284 11 7.49 6 276 15 7.07 8 272 18 8.72
IV
Bus 0 18 27 40.00 0 9 42 17.65 0 10 31 24.39 0 16 36 30.77
0.64 23.26 55.00 93.08 0.85 18.39 30.00 94.84 0.42 20.69 48.33 94.13 0.56 21.84 40.00 94.08
Car 2758 2 1 0.11 2758 2 0 0.07 2758 3 0 0.11 1750 2 0 0.11
Dataset
Truck 0 6 1 14.29 0 6 1 14.29 0 5 2 28.57 6 6 1 14.29
V
Bus 0 0 3 0.00 0 0 4 0.00 0 0 3 0.00 0 0 4 0.00
0.00 25.00 40.00 99.86 0.00 20.00 20.00 99.89 0.00 37.50 40.00 99.82 0.34 25.00 20.00 99.49
Car 494 13 0 2.56 503 11 0 2.14 496 13 0 2.55 481 16 0 3.22
Dataset
Truck 44 72 0 37.93 35 76 0 31.53 39 69 0 36.11 57 71 0 44.53
VI
Bus 0 9 5 64.29 0 7 5 58.33 0 10 5 66.67 0 7 5 58.33
8.18 23.40 0.00 89.64 6.51 19.15 0.00 91.68 7.29 25.00 0.00 90.19 10.59 24.47 0.00 87.44
Car 183 13 0 6.63 282 3 0 1.05 237 6 0 2.47 245 6 0 2.39
Dataset
Truck 17 29 0 36.96 5 61 0 7.58 11 50 0 18.03 13 48 0 21.31
VII
Bus 87 23 4 3.51 0 1 4 20.00 39 9 4 7.69 29 11 4 90.91
36.24 55.38 0.00 60.67 1.74 4.69 0.00 97.47 17.42 23.08 0.00 91.80 14.63 26.15 0.00 90.83
Car 438 20 0 4.37 438 0 0 0.00 438 0 0 0.00 438 0 0 0.00
Dataset
Truck 0 58 0 0.00 0 268 0 0.00 0 263 0 0.00 0 266 0 0.00
VIII
Bus 0 214 0 N/A 0 0 0 N/A 0 5 0 N/A 0 2 0 N/A
0.00 79.86 N/A 67.95 0.00 0.00 N/A 100.00 0.00 1.87 N/A 99.29 0.00 0.75 N/A 99.72
Car 149 2 0 1.32 152 0 0 0.00 151 7 0 4.43 150 8 0 5.06
Dataset
Truck 3 14 0 17.65 0 19 0 0.00 1 12 0 7.69 2 9 0 18.18
IX
Bus 0 3 0 N/A 0 0 0 N/A 0 0 0 N/A 0 2 0 N/A
1.97 26.32 N/A 95.32 0.00 0.00 N/A 100.00 0.66 36.84 N/A 95.32 1.32 52.63 N/A 92.98
5. Discussion
5.1. Datasets Challenges and Advantages
This section addresses the challenges the selected video datasets met in covering an
important portion of the possible scenarios that could arise in traffic flow monitoring. These
various and representative datasets can be used to evaluate cutting-edge deep learning
vehicle detection algorithms more completely. First, illumination changes and shadow,
as the most important challenge met by radiometric cameras due to their sensitivity to
brightness, appear in datasets IV and V (Figure 8a,b). Secondly, a large variety of vehicles,
ranging from private cars with diverse sizes and colors to large heavy vehicles such as
buses, can be found in dataset VIII (Figure 8c) and dataset IV (Figure 8a). These vehicles
can be found in several countries, such as Canada and England. Notably, a range of fields
of view, set by different relations between the cameras and road surfaces (i.e., vertical,
low oblique, high oblique), were considered as they provide different types of contextual
information. For example, more vehicle bodies can be recorded by high oblique cameras
Smart Cities 2023, 6, FOR PEER REVIEW 18
(dataset VIII), while the top parts of vehicles can only be recorded by cameras with a
vertical view (dataset VII) (Figure 8c,d).
Figure 8. Datasets’
Figure 8. Datasets’ challenges;
challenges; (a,b)
(a,b) illumination
illumination and
and shadow;
shadow; (c,d)
(c,d) high
high oblique
oblique and
and low oblique
low oblique
angles of view; (e,f) scale variation of vehicles; (g,h) weather conditions such as foggy and rainy;
angles of view; (e,f) scale variation of vehicles; (g,h) weather conditions such as foggy and rainy; (i)
detection of vehicles that were occluded.
(i) detection of vehicles that were occluded.
5.2. Parameters Sensitivity

Furthermore, the variation in scale of each individual vehicle is another challenge
for the algorithms, which must
This section discusses be robust
various modelsinofthis
eachsituation (Figure in
YOLO version 8e,f). This
terms ofvariation
their com-is
caused bytime
putation the perspectives
and acquiredofclassification
cameras, meaning that The
accuracy. an object
authorsnearofthe
thecamera
YOLOwill show a
structures,
larger scale
unlike SSD orthan more distant
RCNN, releasedobjects.
the fiveThe of {𝑛,
selected
models 𝑠, 𝑚, 𝑙, also
datasets 𝑥}, which
cover different
have beenweather
sorted
based on obtained accuracy and volume of parameters [28]. These models are generated
via trial-and-error. Table 5 displays the model given by YOLOv8, with input values of
image size—640 × 640, mean average precision (mAP), and parameter volume (million).
The parameter volume includes network architecture, activation functions, learning rate,
batch size, regularization techniques, and optimization algorithms. As can be seen, the
conditions such as rainy and foggy to assess whether the algorithms can still detect vehicles
(Figure 8g,h). Finally, vehicle occlusion, shown in Figure 8i, is another parameter that
was covered by the datasets. The YOLOv7 algorithm seemed to work properly in such
situations of unclear or inaccurate information.
Smart Cities 2023, 6, FOR PEER REVIEW 5.2. Parameters Sensitivity 19

This section discusses various models of each YOLO version in terms of their com-
putation time and acquired classification accuracy. The authors of the YOLO structures,
observedunlike SSD
that it or RCNN,
failed to improve released
vehiclethe five models
detection of {in
accuracy s, m, l, x }, which
n, comparison have been sorted
to YOLOv8n,
but alsobased
has theonpotential
obtainedtoaccuracy andobjects
extract false volume(Figure
of parameters
9a). Since[28]. These models
the real-time are generated via
monitoring
trial-and-error.
of vehicles is the main Table
purpose 5 displays themodel
of ITS, the modelofgiven by been
{n} has YOLOv8,
used with
acrossinput values of image
all YOLO
versions. Figure 9b×compares
size—640 640, mean theaverage
YOLO versions,
precisionand we can
(mAP), seeparameter
and that YOLOv8 exhibited
volume (million). The
superior performance
parameter volumein COCO
includes classification. However, activation
network architecture, our study functions,
demonstrates that rate, batch
learning
YOLOv8 had
size, a lower classification
regularization techniques, accuracy than YOLOv7algorithms.
and optimization when applied As canto the
be highway
seen, the YOLOv8n
datasets. To gain a deeper understanding of the parameters
showed a faster performance on the COCO dataset, while the lowest mentioned in Table 5 and the in object
accuracy
methods of calculating them, it is recommended to refer to the paper
classification. Although the model of YOLOv8x showed the greatest computation [58] for further in- time and
sights. the best classification accuracy, we also tested this model on our dataset. We observed that
it failed to improve vehicle detection accuracy in comparison to YOLOv8n, but also has
Table 5. Summary of YOLOv8 models.
the potential to extract false objects (Figure 9a). Since the real-time monitoring of vehicles
Model Size (Pixels) mAPisvalthe Speed
main purpose
CPU ONNX of ITS, Speed
the modelA100of {n} has
Tensor RT beenParams
used across
(M) all
FLOPsYOLO versions.
YOLOv8n 640 Figure 9b compares
37.3 80.4 the YOLO versions, and we can see that3.2YOLOv8 exhibited
0.99 8.7 superior
YOLOv8s 640 performance in
44.9 COCO classification. However,
128.4 1.20 our study demonstrates
11.2 that YOLOv8 had
28.6
YOLOv8m 640 a lower classification
50.2 234.7 accuracy than YOLOv7 1.83 when applied to the highway
25.9 78.9 datasets. To
YOLOv8l 640 gain
52.9 a deeper understanding
375.2 of the parameters
2.39 mentioned in
43.7 Table 5 and
165.2the methods of
YOLOv8x 640 calculating
53.9 them,
479.1 it is recommended to refer
3.53 to the paper [58]
68.2for further insights.
257.8
Figure 9.Figure 9. Parameter

Parameter sensitivity
sensitivity of YOLOof YOLO (a)
models: models: (a)of
outputs outputs of twoofmodels
two models YOLO8;of(b)
YOLO8; (b) comparison
compar-
ison of YOLO
of YOLOmodels in terms
models of parameters’
in terms volume
of parameters’ and acquired
volume accuracies.
and acquired accuracies.
5.3. Algorithms Comparison

This stage gives comparative information about the algorithms regarding vehicle de-
tection, computation time, localization, and classification. Table 4 displays the accuracy
acquired by the algorithms when applied to the nine selected datasets. As can be seen,
Table 5. Summary of YOLOv8 models.
Model Size (Pixels) mAPval Speed CPU ONNX Speed A100 Tensor RT Params (M) FLOPs
YOLOv8n 640 37.3 80.4 0.99 3.2 8.7
YOLOv8s 640 44.9 128.4 1.20 11.2 28.6
YOLOv8m 640 50.2 234.7 1.83 25.9 78.9
YOLOv8l 640 52.9 375.2 2.39 43.7 165.2
YOLOv8x 640 53.9 479.1 3.53 68.2 257.8
5.3. Algorithms Comparison

This stage gives comparative information about the algorithms regarding vehicle
detection, computation time, localization, and classification. Table 4 displays the accuracy
acquired by the algorithms when applied to the nine selected datasets. As can be seen,
YOLOv7 achieved the best vehicle detection accuracy, with a performance of 98.77%,
while Faster RCNN and SSD showed the weakest performance, at about 50%. Also, the
localization accuracy of Faster RCNN (Figure 10) and its computation time (Figure 7) were
lower than those of the YOLO versions. According to our experimentation, Faster RCNN
and SSD algorithms are unsuitable for use in highway vehicle detection. Except for dataset
I, where YOLOv7 clearly showed the best detection performance, all YOLO versions have
shown an acceptable vehicle detection performance above 90%. This means all YOLO
versions work properly in day and nighttime, such as shown in dataset IV, reaching an
accuracy above 98%. Also, all YOLO algorithms, especially YOLOv7, work properly in
the diverse weather conditions presented in datasets I and III, with an accuracy of around
90%. In addition, our series of tests have demonstrated that the camera resolutions, angle
of view (vertical, oblique, high oblique), diversity of vehicles, and even vehicle rear/front
view have not negatively impacted the YOLO results. It is logical to conclude that the
version YOLOv7 is the best vehicle detection and localization model. The recall accuracy of
YOLOv7 was also the highest compared to the other models, while it was slower in terms
of computation time.
One noteworthy observation is the remarkable performance of all YOLO versions,
especially YOLOv7, when applied to nighttime datasets (dataset IV and dataset V). The
results underscore YOLOv7’s exceptional capacity for accurately detecting vehicles dur-
ing low light conditions, with an impressive accuracy exceeding 99%. Following closely,
YOLOv8 also demonstrated a commendable performance, achieving an accuracy rate of ap-
proximately 98%. This outcome showcases the robustness and adaptability of these YOLO
versions, shedding light on their potential applications in scenarios wherein darkness chal-
lenges visibility. These findings validate the efficacy of these models, and emphasize their
relevance to real-world applications wherein nighttime surveillance and object detection
are essential.
Following vehicle detection, a critical parameter for algorithm evaluation is the ac-
curacy of classifying vehicles into car, truck, and bus categories. In this case, the OA
(Equation (11)) parameter in Table 4 again shows that YOLOv7 was the best classifier,
achieving a value of 97.37%, followed by YOLOv6 at 94.24%. Furthermore, the sums of
errors of CE (Equations (12)–(14)) and OE (Equations (15)–(17)) were obtained per class in
Table 4. As can be seen, private cars were more accurately detected, with the lowest errors
of around 10%, while trucks and buses represented the greatest challenge in classification.
The confusion matrix of dataset VIII presented in Table 4 shows the lowest rates of OE
and CE errors. This is because oblique-view cameras capture more contextual information,
which aids the algorithms in achieving superior classification.
Smart
Smart Cities 2023, 66, FOR PEER REVIEW
Cities 2023, 3000
21
Figure 10. Localization

Figure 10. Localization accuracy
accuracy of
of (a)
(a) YOLO
YOLO and
and (b)
(b) Faster
Faster RCNN
RCNN algorithms.
algorithms.
5.4. Comparison with Previous Studies
5.4. Comparison with Previous Studies
We here introduce a novel method for comparing state-of-the-art vehicle detection
We here introduce a novel method for comparing state-of-the-art vehicle detection
algorithms. This comparison process makes us of challenging highway video datasets with
algorithms. This comparison process makes us of challenging highway video datasets
various angles of view. These challenging datasets have not been addressed in previous
with various
studies, such angles
as the of
oneview. These
by Kim andchallenging
Sung [30],datasets have not been
which conducted addressed
a similar in pre-
comparison
vious studies, such as the one by Kim and Sung [30], which conducted
between RCNN, SSD, and YOLO. They did not evaluate the algorithms in different illumi-a similar compari-
son between
nation RCNN,
contexts, such asSSD, and YOLO.
nighttime, They did
and different not evaluate
weather the algorithms
conditions. in different
Also, they customized
the weights of each algorithm on their training data, which is time-consuming.they
illumination contexts, such as nighttime, and different weather conditions. Also, cus-
Indeed,
tomized the weights of each algorithm on their training data, which is
needing no additional training data is one benefit of our work. We suggest that future time-consuming.
Indeed, needing
researchers use thenoprimary
additional
modeltraining
of eachdata is onealgorithm
released benefit offorour work.detection,
vehicle We suggest that
without
future researchers use the primary model of each released algorithm for vehicle
using any training data. Likewise, Song and Liang [59] used thousands of training data in detection,
without using
customizing theany training
weights of data.
YOLO. Likewise, Song and Liang [59] used thousands of training
data Similarly,
in customizing
Zhang, theHuweights of YOLO.
[37] tried to enhance the performance of SSD for vehicle detec-
tion at night. Despite improving thetoSSD
Similarly, Zhang, Hu [37] tried enhance the performance
algorithm and achieving of SSD for detection
better vehicle detec-
and
classification accuracy, its acquired accuracies (around 89%) are still lower than those ofand
tion at night. Despite improving the SSD algorithm and achieving better detection the
YOLO versions (at about 98%). Neupane and Horanont [60] used the models produced by
the YOLO versions as the base for transfer learning when enhancing training data, similarly
to the previous works. In this case, they did not consider various cameras with different
resolutions, or even illumination changes. Also, the enhanced YOLOs were not assessed in
both night and daytime. A couple of studies on vehicle-board cameras have been published
in the context of the evaluation of deep learning in vehicle detection [35,36]. Since these
cameras have a completely different structures and fields of view from highway ones, this
is not an effective way to compare the algorithms.
In conclusion, there is no need for additional training data to enhance the performance
of YOLO versions in vehicle detection. The released versions of YOLO work effectively
in vehicle detection and classification, without any considerable errors in localization.
Heavy trucks are detected more accurately when the camera’s angle of view is oblique,
while private cars are detectable with precision from any direction of view. Noticeably, the
algorithm can be run in real-time situations if a GPU processor is used.
6. Conclusions and Future Works

In this paper, we have compared state-of-the-art deep learning algorithms, such as
SSD, RCNN, and different versions of YOLO, for vehicle detection. These deep learning
structures have been trained and tested on thousands of images acquired from highway
cameras and contained in the COCO library. Nine video cameras facing potential challenges
were selected for a fair and general comparison, covering a large spectrum of vehicle
positions and shapes. These challenging datasets cover numerous angles of view between
the camera and road, with different qualities of video (from both day and night) and
variations in the scale of vehicles. The YOLO versions, particularly YOLOv7, achieved
the best detection and localization accuracy, and the most accurate vehicle classification
results for cars, trucks, and buses. In addition, the computation time of the YOLOs was
under one-tenth of a second when using a CPU processor. This means the running time
will be near real-time if GPU and RAM are used in addition to a CPU processor. With an
accuracy in vehicle detection of about 98%, the YOLO versions can generally be used for
ITS purposes such as real-time traffic flow monitoring.
In future research, it is strongly recommended to further evaluate deep learning archi-
tectures by application to Unmanned Aerial Vehicle (UAV) videos, which can encompass
larger road areas. This will address the significant lack of accuracy in the classification of
heavy vehicles, such as buses, which should be a priority. Furthermore, investigating the
potential for the accurate localization and tracking of detected vehicles is crucial. These
enhancements are anticipated to yield more realistic data for road traffic simulators.
Author Contributions: Conceptualization, D.S. and C.L.; methodology, D.S., C.L. and S.H.; software,
D.S, C.L. and S.H.; validation, C.L. and S.H.; formal analysis, C.L. and S.H.; investigation, D.S.;
resources, D.S.; data curation, D.S.; writing—original draft preparation, D.S.; writing—review and
editing, C.L. and S.H.; visualization, D.S.; supervision, C.L. and S.H.; project administration, C.L.;
funding acquisition, C.L. and S.H. All authors have read and agreed to the published version of
the manuscript.
Funding: This research was funded by Mitacs grant number IT30935 and Semaphor.ai.
Data Availability Statement: Data sharing is not applicable to this paper.
Acknowledgments: The authors would like to thank all the individuals and organizations who
made these datasets and algorithms available. In particular, we want to express our sincere appreci-
ation and gratitude to the Semaphor.ai team and Mitacs for their funding support in making this
project possible.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Lv, Z.; Shang, W. Impacts of intelligent transportation systems on energy conservation and emission reduction of transport
systems: A comprehensive review. Green Technol. Sustain. 2023, 1, 100002. [CrossRef]
2. Pompigna, A.; Mauro, R. Smart roads: A state of the art of highways innovations in the Smart Age. Eng. Sci. Technol. Int. J. 2022,
25, 100986. [CrossRef]
3. Regragui, Y.; Moussa, N. A real-time path planning for reducing vehicles traveling time in cooperative-intelligent transportation
systems. Simul. Model. Pract. Theory 2023, 123, 102710. [CrossRef]
4. Wu, Y.; Wu, L.; Cai, H. A deep learning approach to secure vehicle to road side unit communications in intelligent transportation
system. Comput. Electr. Eng. 2023, 105, 108542. [CrossRef]
5. Zuo, J.; Dong, L.; Yang, F.; Guo, Z.; Wang, T.; Zuo, L. Energy harvesting solutions for railway transportation: A comprehensive
review. Renew. Energy 2023, 202, 56–87. [CrossRef]
6. Yang, Z.; Peng, J.; Wu, L.; Ma, C.; Zou, C.; Wei, N.; Zhang, Y.; Liu, Y.; Andre, M.; Li, D.; et al. Speed-guided intelligent
transportation system helps achieve low-carbon and green traffic: Evidence from real-world measurements. J. Clean. Prod. 2020,
268, 122230. [CrossRef]
7. Chen, Z.; Guo, H.; Yang, J.; Jiao, H.; Feng, Z.; Chen, L.; Gao, T. Fast vehicle detection algorithm in traffic scene based on improved
SSD. Measurement 2022, 201, 111655. [CrossRef]
8. Ribeiro, D.A.; Melgarejo, D.C.; Saadi, M.; Rosa, R.L.; Rodríguez, D.Z. A novel deep deterministic policy gradient model applied to
intelligent transportation system security problems in 5G and 6G network scenarios. Phys. Commun. 2023, 56, 101938. [CrossRef]
9. Sirohi, D.; Kumar, N.; Rana, P.S. Convolutional neural networks for 5G-enabled Intelligent Transportation System: A systematic
review. Comput. Commun. 2020, 153, 459–498. [CrossRef]
10. Lackner, T.; Hermann, J.; Dietrich, F.; Kuhn, C.; Angos, M.; Jooste, J.L.; Palm, D. Measurement and comparison of data rate
and time delay of end-devices in licensed sub-6 GHz 5G standalone non-public networks. Procedia CIRP 2022, 107, 1132–1137.
[CrossRef]
11. Wang, Y.; Cao, G.; Pan, L. Multiple-GPU accelerated high-order gas-kinetic scheme for direct numerical simulation of compressible
turbulence. J. Comput. Phys. 2023, 476, 111899. [CrossRef]
12. Sharma, H.; Kumar, N. Deep learning based physical layer security for terrestrial communications in 5G and beyond networks:
A survey. Phys. Commun. 2023, 57, 102002. [CrossRef]
13. Ounoughi, C.; Ben Yahia, S. Data fusion for ITS: A systematic literature review. Inf. Fusion 2023, 89, 267–291. [CrossRef]
14. Afat, S.; Herrmann, J.; Almansour, H.; Benkert, T.; Weiland, E.; Hölldobler, T.; Nikolaou, K.; Gassenmaier, S. Acquisition time
reduction of diffusion-weighted liver imaging using deep learning image reconstruction. Diagn. Interv. Imaging 2023, 104, 178–184.
[CrossRef] [PubMed]
15. Xu, M.; Yoon, S.; Fuentes, A.; Park, D.S. A Comprehensive Survey of Image Augmentation Techniques for Deep Learning. Pattern
Recognit. 2023, 137, 109347. [CrossRef]
16. Zhou, Y.; Ji, A.; Zhang, L.; Xue, X. Sampling-attention deep learning network with transfer learning for large-scale urban point
cloud semantic segmentation. Eng. Appl. Artif. Intell. 2023, 117, 105554. [CrossRef]
17. Yu, C.; Zhang, Z.; Li, H.; Sun, J.; Xu, Z. Meta-learning-based adversarial training for deep 3D face recognition on point clouds.
Pattern Recognit. 2023, 134, 109065. [CrossRef]
18. Kim, C.; Ahn, S.; Chae, K.; Hooker, J.; Rogachev, G. Noise signal identification in time projection chamber data using deep
learning model. Nucl. Instrum. Methods Phys. Res. Sect. A Accel. Spectrometers Detect. Assoc. Equip. 2023, 1048, 168025. [CrossRef]
19. Zhang, X.; Zhai, D.; Li, T.; Zhou, Y.; Lin, Y. Image inpainting based on deep learning: A review. Inf. Fusion 2023, 90, 74–94.
[CrossRef]
20. Mo, W.; Zhang, W.; Wei, H.; Cao, R.; Ke, Y.; Luo, Y. PVDet: Towards pedestrian and vehicle detection on gigapixel-level images.
Eng. Appl. Artif. Intell. 2023, 118, 105705. [CrossRef]
21. Bie, M.; Liu, Y.; Li, G.; Hong, J.; Li, J. Real-time vehicle detection algorithm based on a lightweight You-Only-Look-Once
(YOLOv5n-L) approach. Expert Syst. Appl. 2023, 213, 119108. [CrossRef]
22. Liang, Z.; Huang, Y.; Liu, Z. Efficient graph attentional network for 3D object detection from Frustum-based LiDAR point clouds.
J. Vis. Commun. Image Represent. 2022, 89, 103667. [CrossRef]
23. Tian, Y.; Guan, W.; Li, G.; Mehran, K.; Tian, J.; Xiang, L. A review on foreign object detection for magnetic coupling-based electric
vehicle wireless charging. Green Energy Intell. Transp. 2022, 1, 100007. [CrossRef]
24. Yang, Z.; Pun-Cheng, L.S. Vehicle detection in intelligent transportation systems and its applications under varying environments:
A review. Image Vis. Comput. 2018, 69, 143–154. [CrossRef]
25. Wang, Z.; Ma, Y.; Zhang, Y. Review of pixel-level remote sensing image fusion based on deep learning. Inf. Fusion 2023, 90, 36–58.
[CrossRef]
26. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision
and Pattern Recogniti; Springer International Publishing: Cham, Switzerland, 2016.
27. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-Based Convolutional Networks for Accurate Object Detection and
Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 142–158. [CrossRef]
28. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073.
[CrossRef]
29. Ramachandran, A.; Sangaiah, A.K. A review on object detection in unmanned aerial vehicle surveillance. Int. J. Cogn. Comput.
Eng. 2021, 2, 215–228. [CrossRef]
30. Kim, J.A.; Sung, J.Y.; Park, S.H. Comparison of Faster-RCNN, YOLO, and SSD for Real-Time Vehicle Type Recognition. In
Proceedings of the 2020 IEEE International Conference on Consumer Electronics—Asia (ICCE-Asia), Seoul, Republic of Korea,
1–3 November 2020.
31. Qiu, Q.; Lau, D. Real-time detection of cracks in tiled sidewalks using YOLO-based method applied to unmanned aerial vehicle
(UAV) images. Autom. Constr. 2023, 147, 104745. [CrossRef]
32. Dang, F.; Chen, D.; Lu, Y.; Li, Z. YOLOWeeds: A novel benchmark of YOLO object detectors for multi-class weed detection in
cotton production systems. Comput. Electron. Agric. 2023, 205, 107655. [CrossRef]
33. Li, M.; Zhang, Z.; Lei, L.; Wang, X.; Guo, X. Agricultural Greenhouses Detection in High-Resolution Satellite Images Based on
Convolutional Neural Networks: Comparison of Faster R-CNN, YOLO v3 and SSD. Sensors 2020, 20, 4938. [CrossRef] [PubMed]
34. Azimjonov, J.; Özmen, A. A real-time vehicle detection and a novel vehicle tracking systems for estimating and monitoring traffic
flow on highways. Adv. Eng. Inform. 2021, 50, 101393. [CrossRef]
35. Han, X.; Chang, J.; Wang, K. Real-time object detection based on YOLO-v2 for tiny vehicle object. Procedia Comput. Sci. 2021,
183, 61–72. [CrossRef]
36. Tao, C.; He, H.; Xu, F.; Cao, J. Stereo priori RCNN based car detection on point level for autonomous driving. Knowl. -Based Syst.
2021, 229, 107346. [CrossRef]
37. Zhang, Q.; Hu, X.; Yue, Y.; Gu, Y.; Sun, Y. Multi-object detection at night for traffic investigations based on improved SSD
framework. Heliyon 2022, 8, e11570. [CrossRef]
38. Shawon, A. Road Traffic Video Monitoring. 2020. Available online: https://www.kaggle.com/datasets/shawon10/road-traffic-v
ideo-monitoring?select=traffic_detection.mp4 (accessed on 1 January 2021).
39. Shah, A. Highway Traffic Videos Dataset. 2020. Available online: https://www.kaggle.com/datasets/aryashah2k/highway-tra
ffic-videos-dataset (accessed on 1 March 2020).
40. Saha, S. A Comprehensive Guide to Convolutional Neural Networks—The ELI5 Way. 2018. Available online: https://towa
rdsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 (accessed on 15
December 2018).
41. Ding, J.; Li, X.; Kang, X.; Gudivada, V.N. Augmentation and evaluation of training data for deep learning. In Proceedings of the
2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017.
42. Phuong, T.M.; Diep, N.N. Speeding Up Convolutional Object Detection for Traffic Surveillance Videos. In Proceedings of the 2018
10th International Conference on Knowledge and Systems Engineering (KSE), Ho Chi Minh City, Vietnam, 1–3 November 2018.
43. Tian, Y.; Su, D.; Lauria, S.; Liu, X. Recent advances on loss functions in deep learning for computer vision. Neurocomputing 2022,
497, 129–158. [CrossRef]
44. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014.
45. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016.
46. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
47. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
48. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
49. Jocher, G. Yolov5. Code Repository. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 July 2020).
50. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection
framework for industrial applications. arXiv 2022, arXiv:2209.02976.
51. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors. arXiv 2022, arXiv:2207.02696.
52. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13
December 2015.
53. Chen, X.; Gupta, A. An implementation of faster rcnn with study for region sampling. arXiv 2017, arXiv:1702.02138.
54. Powers, D.M. Evaluation: From precision, Recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020,
arXiv:2010.16061.
55. Vostrikov, A.; Chernyshev, S. Training sample generation software. In Intelligent Decision Technologies 2019, Proceedings of the
11th KES International Conference on Intelligent Decision Technologies (KES-IDT 2019), St. Julians, Malta, 17–19 June 2019; Springer:
Berlin/Heidelberg, Germany, 2019; Volume 2.
56. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in
context. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Part V
13; Springer: Berlin/Heidelberg, Germany, 2014.
57. Bathija, A.; Sharma, G. Visual object detection and tracking using Yolo and sort. Int. J. Eng. Res. Technol. 2019, 8, 705–708.
58. Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. arXiv 2023,
arXiv:2304.00501.
59. Song, H.; Liang, H.; Li, H.; Dai, Z.; Yun, X. Vision-based vehicle detection and counting system using deep learning in highway
scenes. Eur. Transp. Res. Rev. 2019, 11, 51. [CrossRef]
60. Neupane, B.; Horanont, T.; Aryal, J. Real-Time Vehicle Classification and Tracking Using a Transfer Learning-Improved Deep
Learning Network. Sensors 2022, 22, 3813. [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

Smart Cities

Uploaded by

Copyright:

Available Formats

Smart Cities

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Smart Cities

Uploaded by

Copyright:

Available Formats

smart cities

Academic Editors: Katarzyna Turoń 1. Introduction

Smart Cities 2023, 6, 2982–3004. https://doi.org/10.3390/smartcities6050134 https://www.mdpi.com/journal/smartcities

2. Traffic Video Data

Figure 3. An overview of a deep learning structure [40].

Hyperbolic Tangent: 𝑇𝑎𝑛ℎ = = (3)

3.2. You Only Look Once (YOLO)

3.3. Region-Based Convolutional Neural Network (RCNN)

Yolov8 Yolov7 Yolov6 Yolov5 Faster RCNN SSD

4.2. Localization Accuracy

4.3. Running Time

4.4. Vehicle Classification

4.4. Vehicle Classification

PCC + PTT + PBB

Table 3. Parameters of a confusion matrix used for classification accuracy evaluation.

Yolov8 Yolov7 Yolov6 Yolov5

5.2. Parameters Sensitivity

Smart Cities 2023, 6, FOR PEER REVIEW 5.2. Parameters Sensitivity 19

Figure 9.Figure 9. Parameter

5.3. Algorithms Comparison

Table 5. Summary of YOLOv8 models.

5.3. Algorithms Comparison

Figure 10. Localization

6. Conclusions and Future Works

You might also like