Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

Optimization Algorithm To Reduce Training Time For Deep Learning Computer Vision Algorithms Using Large Image Datasets With Tiny Objects

A research paper on Deep Learning

Uploaded by

Dude
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Optimization Algorithm To Reduce Training Time For Deep Learning Computer Vision Algorithms Using Large Image Datasets With Tiny Objects

A research paper on Deep Learning

Uploaded by

Dude
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received 22 August 2023, accepted 12 September 2023, date of publication 18 September 2023,

date of current version 28 September 2023.


Digital Object Identifier 10.1109/ACCESS.2023.3316618

Optimization Algorithm to Reduce Training Time


for Deep Learning Computer Vision Algorithms
Using Large Image Datasets With Tiny Objects
SERGIO BEMPOSTA ROSENDE1 , JAVIER FERNÁNDEZ-ANDRÉS 2,

AND JAVIER SÁNCHEZ-SORIANO 3


1 Department of Science, Computing and Technology, Universidad Europea de Madrid, Villaviciosa de Odón, 28670 Madrid, Spain
2 Department of Industrial and Aerospace Engineering, Universidad Europea de Madrid, Villaviciosa de Odón, 28670 Madrid, Spain
3 Escuela Politécnica Superior, Universidad Francisco de Vitoria, Pozuelo de Alarcón, 28223 Madrid, Spain
Corresponding author: Javier Sánchez-Soriano (javier.sanchez@ufv.es)
This work was supported in part by the I+D+i Projects funded by Ministerio de ciencia e innovación, Agencia estatal de investigación
10.13039/501100011033 under Grant PID2019-104793RB-C32, Grant PIDC2021-121517-C33, and Grant PDC2022-133684-C33.

ABSTRACT The optimization of convolutional neural networks (CNN) generally refers to the improvement
of the inference process, making it as fast and precise as possible. While inference time is an essential factor
in using these networks in real time, the training of CNNs using very large datasets can be costly in terms
of time and computing power. This study proposes a technique to reduce the training time by an average of
75% without altering the results of CNN training with an algorithm which partitions the dataset and discards
superfluous objects (targets). This algorithm is a tool that pre-processes the original dataset, generating a
smaller and more condensed dataset to be used for network training. The effectiveness of this tool depends on
the type of dataset used for training the CNN and is particularly effective with sequential images (video), large
images and images with tiny targets generally from drones or traffic surveillance cameras (but applicable
to any other type of image which meets the requirements). The tool can be parameterized to meet the
characteristics of the initial dataset.

INDEX TERMS Computer vision, dataset, deep learning, training optimization, OpenCV, YOLO.

I. INTRODUCTION Limited progress has been made however in CNN train-


Cameras and video technology is continuously improving, ing [7]. While neural networks are, in theory, trained only
and it is increasingly common to find images in FullHD, once and then later depend on inference, the fact is that
2K, 4K or even 8K used as input for training convolu- neural networks are continuously being retrained, either with
tional neural networks (CNN) [1]. Computing capacity has new datasets or modifications in the parameters of training
also increased significantly [2], and a great deal of effort algorithms.
is being made to develop hardware with the capacity to Given the current size of images [8], and the need for
run neural networks in real time [3]. This hardware is increasingly exact or precise detection of objects within these
becoming increasingly compact, efficient, and affordable, images, training times are growing [9] as classic methods of
enabling embedded or distributed training systems for the optimizing training become less effective [10].
construction of distributed object detection and surveillance There are two commonly used methods to reduce training
systems [4], [5], [6]. times for deep neural networks:
1. Image size reduction [11]. This is an effective
method if the objects to be detected or classified
The associate editor coordinating the review of this manuscript and occupy a sufficiently large part of the total image so
approving it for publication was Gustavo Olague . that, even when the image is reduced, these objects
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 11, 2023 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 104593
S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

still provide sufficient information for the training A. TERMS USED IN ALGORITHM DEFINITION
algorithm [9] [13]. • Target (Object) or BoundingBox: A labelled element
2. Partition of the original image into a mosaic of in the image that the neural network should detect. This
images [7], [14], [15]. This method reduces the size may be any of the type of object that the future neural
of the image, dividing it into several parts with a prede- network will detect by inference.
fined size (usually 3×3 or 4×4) with equal dimensions • Selected object: An object labelled in the image that has
(length and width) to maintain the same proportions as been selected as input for the neural network. This object
the original image. was chosen to be part of the set of objects used to train
Both methods reduce the size of images, which can be the neural network.
processed using more modest hardware, particularly when • Discarded object: An object labelled to be discarded as
memory is the principal limitation to processing large images. input in the neural network. This object may be dupli-
Both methods, however, have certain drawbacks: cated, cut, etc. and is discarded for training purposes.
• Cropped region: Portion of the image surrounding a
• Image size reduction [16]. If objects are small, the
‘‘selected object’’. The size of this region is a config-
loss of resolution may mean these objects become unde-
urable parameter of the algorithm. The region is the
tectable.
piece of the image inputted into the neural network for
• Partition of the original image into a mosaic of
training in which there is at least one selected object.
images. The image being processed may be smaller but
• Key image: An image on which the object discard-
there are more images to process. Additionally, objects
ing process is not applied. This is established every
may be cut between two images. The superimposition
‘‘N’’ images. This ‘‘N’’ parameter is configurable in the
of the regions is a way to minimize this although it
algorithm.
does not solve the problem as the area of superimpo-
sition must be very large resulting in an even greater The difference between an ‘‘object’’ and a ‘‘selected object’’
reduction of the object, reducing the effectiveness of this is that not all marked objects in the image to be recognized
solution. are part of the input for training the neural network. Of all the
labelled objects, only a subset of these per image will be part
In this study we propose a method to optimize training times
of the input of the neural network, the rest being discarded.
without the losses indicated above. This method was vali-
dated in a case study using traffic images captured by drone.
B. ALGORITHM
This involved a handicap because the objects of interest were
This algorithm, as opposed to the methods described in the
very small compared with the total size of the image. Thus,
bibliography, consists of two phases:
a solution to reduce the original image was ruled out. For
example, the size of a car or pedestrian in an image taken 1. Discard of objects and reduction of the training set.
by a drone at a height of 50 meters may be approximately 2. Cropping of the training regions and new labelling of
20 × 20 pixels, if we reduce the image to a size that can be objects.
processed by a PyTorch or Tensor Flow type network, that is, To clarify, we will use a training dataset from high-definition
up to 640×640 pixels, we are reducing the image to one-fifth, videos or consecutive images taken in short time intervals.
and the objects will be too small to be accurately detected In either case, these images are from a great distance where
by the neural networks. Although YOLO can theoretically be each image contains various marked targets for training with a
trained using target as small as 2×2 pixels [17], our tests with very small size considering the total size of the image. Each of
targets smaller than 16 × 16 pixels had a very low degree of these images is inputted into the algorithm in the same order
precision. they were taken by the camera (see flow chart of all the steps
In this study we will describe the method used to sig- of the algorithm in Figure 1).
nificantly reduce processing times without diminishing the In the first phase, that of discarding, all targets are
effectiveness of the trained network. checked against the objects in the previous image. The first
image of the dataset is considered a ‘‘key’’ image, so no target
II. TRAINING OPTIMIZATION ALGORITHM is discarded, and this phase is omitted. If these targets show
This algorithm is designed to pre-process the labelled images relatively little movement compared to the previous image
of a dataset prior to being used in the habitual training process they are discarded; they will not be selectable objects and will
for a deep neural network. The dataset must be labelled be discarded. This parameter, ‘‘relatively small distance’’,
using the format of a YOLO type network [18]. Thus, the is configurable. The values which given the best results are
input of this algorithm is one dataset, and the output another 1% to 3% of the total image. In 2K or 4K resolution images
dataset constructed using the original images but optimized these are approximately 5 to 15 pixels. The principal factors
for training (also in YOLO format). For datasets other than affecting the selection of this parameter are:
those of the YOLO type, the labels can be translated for use • Type of recorded scene. From very static scenes to
in other formats. For this reason, the method is replicable and scenes with lots of movement. The more the objects
extendible to other dataset formats. move the greater the discrimination distance.

104594 VOLUME 11, 2023


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

• Number of Frames Per Second (FPS) at input. When the region does not extend beyond the limits of the image,
the sequence of images is very close in time objects have maintaining the same proportions and size.
a smaller displacement between frames. The higher the Each of the cropped regions is checked for other objects,
FPS the less the discrimination distance. including those discarded in the first phase. For each of
• Rotation of objects. If the target objects in the image the objects identified within the region one of the following
move in rotation, that is, around a central axis within the options is applied:
BoundingBox rather than moving across the image, this The Object Is Entirely Within the Cropped Region: This
may cause a loss of the object for training. In this case, object is labelled to be part of the training. If the object is
pre-processing is simply not recommended. selectable, that is, not discarded in the first phase, it will now
be marked as ‘‘not selectable’’ as it is now part of a training
region.

FIGURE 2. Example of the pre-processor in operation. Using the original


image, two regions or sub-images are generated (green squares),
centered on the red target objects. The blue targets are ROI included for
training which do not generate their own sub-image.

The Object Is Partially Within the Cropped Region:


• If more than 50% of the object is within the region (this
value is configurable and set at 50% for the training pro-
cess), it is labelled but not marked as ‘‘not selectable’’
FIGURE 1. Flow chart diagram of the functioning of the algorithm. and will continue to open to creating its own training
region.
In the case studies, most of the recorded images are scenes • If less than 50% of the object is within the region,
from highways with an average of 2 FPS, although there the object is not labelled and is deleted (for example,
are some images of agglomerations of people in pedestrian by blurring the image through Gaussian elimination) and
streets or at sports events with a resolution of 1920 × 1080. it is not marked as ‘‘not selectable’’. In order not to
In these cases, the parameter was configured at 15 pixels of pollute the training process, these images are blurred
displacement. rather than eliminated (painted a background color) to
The second phase uses a set of objects that have not been prevent the network from learning that a specific color
discarded and so are selectable objects. Each of the selectable (the background color) has any specific utility and incor-
objects are delimited by a cropped region that is labelled in porating it into its training criteria.
the image for training purposes. This region is configurable • If the object has been deleted, the labelled and selectable
in terms of size and position, but all are the same size with objects will be rechecked to verify that the deletion of a
the same ratio or proportion as the original image. The size specific object has not led to the elimination of any com-
of the region will depend on the grouping of the objects used plete objects if the areas of interest (BoundingBoxes)
for training as well as the size of the image. Larger regions overlap. If this is the case, restore the image in this area
encompass more space within the image, thus reducing the to ensure the selectable object to be used in the training
total number of regions but also increasing the computation is complete.
cost of training. Furthermore, it is important that the region Figure 2 shows how from the original image 2 regions
is sufficiently large for the selected object to be contained or sub-images are generated (green boxes) with the training
entirely within it. The cropped region must have the same targets marked in red. These targets generate regions or sub-
length-width proportions as the images used for training and images, the blue targets are ROI included for training which
later for inference. This is a critical factor in the effectiveness do not generate their own regions or sub-images. Figure 3
of convolutional neural networks. It is also important that shows how from an original, full-sized image, 4 regions or

VOLUME 11, 2023 104595


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

sub-images are generated (green boxes) containing the train- but only that there be static objects of interest. This is
ing targets. In this example we see how one of the labelled a factor that reduces the set of images for training, thus
targets is partially blurred, marked in white, because it was reducing the size of the dataset and making the process
marked as discardable in one of the regions or sub-images faster.
because it there is less than a 50% overlap.

FIGURE 4. Frame 1 of a sequence of two video frames with static (blue)


FIGURE 3. Application with blurred targets (white box). and moving (red) objects of interest. There is a clear non-uniformity in
the density of objects in the image.

By creating a region centered on a selected object, the


network will always be trained with a central labelled object.
This may cause the network to learn to always expect to detect
objects in the center of the region or image. This will not be
a problem for this project, given that we have used YOLO
as a CNN, which initially divides the image into sections
(by default, into 7 × 7 sections) [16] and that each section
has the same probability of containing an object regardless of
its position [19], [20].
It is important to note that this method is not ideal for all
situations or for all datasets. For this article, we conducted
tests using three different datasets, all of them public and
verifiable, which allowed us to determine which factors are
most beneficial for this algorithm. From these case studies it
FIGURE 5. Frame 2 of a sequence of two video frames with static (blue)
was determined that there is no optimum configuration for all and moving (red) objects of interest. There is a clear non-uniformity in
the configurable parameters of the application. The complete the density of objects in the image.
set of images of the dataset determines the configuration and
effectiveness of the algorithm. The most key factors which In summary, the types of datasets which best respond to
determine the effectiveness of the algorithm (as shown in these factors are those consisting of video images from drones
Figure 4 and Figure 5) are: or high-resolution static cameras. In this type of dataset, the
• Large images, such as FullHD, 2K, 4K or even larger, images are chronological and usually have high resolution.
and with small objects or ‘‘targets’’ to detect considering Examples are drone videos observing beaches, roadways,
the size of the image. Examples may be images taken parks, large agglomerations of people or animals, etc. other
from a certain distance where the elements to detect are examples include security or surveillance cameras on high-
distant. ways, streets, buildings, etc. where there are many objects
• Images taken at short intervals. That is, video images. of interest distributed in specific areas of the image, such as
It is not necessary that the time interval between images cars on a highway, or doorways and entrances for surveil-
is very short (images at 1 FPS is optimum) but they must lance cameras, etc. These images also generally contain static
be sequential and taken from a relatively static camera. objects of interest, such as people lying on the beach or parked
• Few objects within the image or the objects are not cars on a city street.
evenly distributed within the image. That is, objects In this case, we used the YOLO neural network which
should be grouped in zones. These images will have has a series of limitations which make it ideal for the pre-
large areas with no objects to detect and these areas can processing algorithm. YOLO divides the image into regions
thus be eliminated from the training processes. for analysis and each region is assigned a maximum number
• There are static objects of interest in the image. It is of objects [19], [20]. Thus, YOLO is limited to a specific
not necessary that static object predominate in the scene number of objects per region. By dividing the image around

104596 VOLUME 11, 2023


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

groups of objects these are distributed within the new image,


permitting a greater number of detections given that there are
objects within each of the regions created by YOLO.
It is important to note that this limitation is not critical, but
it is a factor to be considered since the number of objects can FIGURE 7. Frames extracted from the dataset, corresponding to three
be [16], [17], [18] in the YOLO network. However, the higher different roundabouts with light traffic, heavy traffic and very light traffic.
the number of objects the slower the process becomes with
greater memory consumption. This is a generic factor for all
the regions into which YOLO divides the original image, and The VisDrone 2019 dataset was compiled by AISKYEYE
so the number of objects must be adjusted for the region with at the Machine Learning and Data Mining Laboratory at the
the most objects rather than the average number. Tianjin University, China [24]. The complete dataset con-
sists of 288 video clips with a total of 261,908 frames and
III. EVALUATED DATASETS 10,209 static images captured by various drone-mounted
To determine the effectiveness of the dataset pre-processing cameras with a wide range of different characteristics such as
algorithm, we experimented with three different datasets, all location (14 different cities), setting (urban and rural), objects
publicly accessible: Drone [21], Roundabout [22] and Vis- (pedestrians, vehicles, bicycles, etc.) and density (dispersed
Drone [24]. The following section provides a description of or very congested scenes).
the principal characteristics of these datasets.

A. ‘‘DRONE’’ DATASET
This dataset consists of images of road traffic in Spain [21],
with 12 video sequences recorded by a UAV (Unmanned
Arial Vehicle) or drone and from static cameras. These FIGURE 8. Frames extracted from the dataset, corresponding to a parking
are principally images of critical traffic points such as lot, an intersection, and a rotunda with different intensities of traffic.
intersections and roundabouts. The videos are recorded at
1 frame per second in 4K resolution. The total dataset con- It should be noted that the data set was compiled using
sists of 17,570 images of marked objects (types) such as several different drones in various scenarios and under diverse
‘‘cars’’ and ‘‘motorcycles’’. In total there are over 155,000 weather and lighting conditions. These frames were manually
labelled objects in the dataset: 137,000 cars (88.6%) and annotated with specific objects of interest such as pedestrians,
18,000 motorcycles (11.4%). Three frames extracted from the cars, bicycles, and tricycles. Other important attributes are
dataset are presented in Figure 6. also provided such as visibility of the scene, type of object and
ambient occlusion for a better use of the data. Three sample
frames from this dataset are provided in Figure 8.

TABLE 1. Types and their occurrence (number and percentage) in the


visdrone dataset.

FIGURE 6. Frames extracted from the dataset, corresponding to a section


of interurban roadway and a split roundabout.

B. ‘‘ROUNDABOUT’’ DATASET
This dataset consists of areal images of rotundas in Spain
taken with a drone [22], along with their respective annota-
tions in XML (PASCAL VOC) files indicating the position
of the vehicles. In total, the dataset consists of 54 sequences
of drone video with a central view of roundabouts. There
are a total of over 65,000 images with a resolution of For our study, we used only 79 sequences of video con-
1920 × 1080 with 245,000 labelled objects (types): 236,000 sisting of 33,600 frames. There are a total of over 1.5 million
cars (96.4%), 4,900 motorcycles (2.0%), 2,000 trucks (0.9%) labelled items in the dataset, distributed as shown in Table 1.
and 1,700 buses (0.7%). Three frames extracted from the
dataset are presented in Figure 7. IV. PRE-PROCESSING OF THE DATASETS
The three datasets were pre-processed using the algorithm
C. ‘‘VISDRONE’’ DATASET discussed in this study, using the following equipment: an
This dataset is a largescale reference point with carefully ninth generation Intel i7 processor with 64Gb RAM, SSD
annotated data for a computer vision of drone images. hard drive and RTX 2060 graphics card with 8Gb RAM.

VOLUME 11, 2023 104597


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

For software, the study used Microsoft Visual C++ and the the mAP metric adjusted to the value 0.5. The training results
OpenCV v4.5 library for their facility in generating compila- in the different epochs are shown in Figure 11.
tion files for both Windows and Linux. For this dataset, consisting of 17K images in 2K quality, the
training time using the YOLO algorithm and ‘‘Yolov5m’’ net-
A. PROCESSING THE ‘‘DRONE’’ DATASET work for 20 epochs, was 14 hours and 46 minutes, while the
The dataset was processed as follows: training time using the same computer for the pre-processed
• Initial 640 × 360 image to maintain the same proportion dataset was 1 hour and 35 minutes. If we reanalyze the
as the images in the original dataset. graph of the mAP_0.5 metric but considering training time
• Objects of interest were discarded with their position rather than epochs (Figure 12.), we see a time reduction of
does not vary in 10px of the image. some 89.3%.
• Deletion of counted objects when their area is less
than 50%.
• Key image every 7 frames.
After pre-processing the original dataset, the set is reduced to
some 15,000 images, with 43,000 labelled objects, of which
36,000 are cars (82.4%) and 7,600 motorcycles (17.6%).
A comparison of the original and pre-processed datasets is
provided in Figure 9 and Figure 10.

FIGURE 12. mAP_0.5 graph of the time differences in training. The hours
of training are indicated on the horizontal axis.

There was a significant reduction in training time. The


additional time used for pre-processing, for this dataset
14 minutes, is largely insignificant compared to total training
FIGURE 9. Evolution of the number of images and labels after time. For pre-processing, as opposed to the training process
pre-processing of the ‘‘Drone’’ dataset. There is a slight decrease in
images and a significant decrease in labels. for the network, what is most important is not only the
graphics card but also storage capacity since the algorithm
loads a great deal of images. In our case, we used an SSD
hard drive with a Read/Write speed of 600Mb/s.

B. PROCESSING THE ‘‘ROUNDABOUT’’ DATASET


The dataset was processed as follows:
• Initial 640 × 360 image to maintain the same proportion
as the images in the original dataset.
• Objects of interest were discarded with their position
does not vary in 10px of the image.
FIGURE 10. Evolution of the number of labels assigned to each type after • Deletion of counted objects when their area is less
pre-processing of the ‘‘Drone’’ dataset. than 50%.
• Key image every 7 frames.
After pre-processing the original dataset, the total number of
images was increased to some 188,000, with 756,000 labelled
objects, of which 727,000 were cars (96.2%), 10,000 were
motorcycles (1.4%), 9,900 were trucks (1.3%) and 7,800
were buses (1.0%). Figure 13. and Figure 14. provide a
comparison of the original and pre-processed datasets. In this
case, the number of images increased by a rate of 1 to 3.04.
Both datasets, the original and the pre-processed, were
used to train a ‘‘medium-sized’’ YoloV5 neural network. The
training results in the different epochs are shown in Figure 15.
If we reanalyze the graph of the mAP_0.5 metric but
FIGURE 11. Training results for mAP_0.5 of the original and
pre-processed images of the ‘‘Drone’’ dataset. considering training time rather than epochs on the horizontal
axis (Figure 16.), we see a time reduction of some 43.0%.
Both datasets, the original and the pre-processed, were To this time must be added an additional 30 minutes in pre-
used to train a ‘‘medium-sized’’ YoloV5 neural network with processing time for this dataset.

104598 VOLUME 11, 2023


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

• Objects of interest were discarded with their position


does not vary in 10px of the image.
• Deletion of counted objects when their area is less
than 50%.
• Key image every 7 frames.
After pre-processing the original dataset, the set of images
is increased to 51.5k images, with 600K labelled objects
FIGURE 13. Evolution of the number of images and labels after
pre-processing of the ‘‘Roundabout’’ dataset. There are significantly more (see Table 2). A comparison between the original and pre-
images and labels. processed dataset is provided in Figure 17.

TABLE 2. Types and their occurrence (number and percentage) in the


Visdrone dataset after processing for the optimization of the training.

FIGURE 14. Evolution of the number of labels assigned to each type after
pre-processing of the ‘‘Roundabout’’ dataset. Cars are the most affected
type with a significant increase in the number of labels.

FIGURE 17. Evolution of the number of images and labels after


pre-processing of the ‘‘Visdrone’’ dataset. There is a slight increase in
images and a significant decrease in labels.

FIGURE 15. Training results for mAP_0.5 of the original and


pre-processed images of the ‘‘Roundabout’’ dataset. In this case, the number of images has increased by a
rate of 1 to 1.543 (154.3%) while the number of labelled
objects falls to 38.8%. Both datasets, the original and the
pre-processed, were used to train a ‘‘medium-sized’’ YoloV5
neural network. Training results in the different epochs are
shown in Figure 18.
If we reanalyze the graph of the mAP_0.5 metric but
considering training time rather than epochs on the horizontal
axis (Figure 19.), we see a time reduction of some 75.0%.
To this time must be added an additional 25 minutes in pre-
processing time for this dataset.
FIGURE 16. mAP_0.5 graph of the time differences in training. The hours
of training for the ‘‘Roundabout’’ dataset are indicated on the horizontal V. RESULTS
axis. The results were validated using two networks with dif-
ferent training procedures. Firstly, a network trained using
original images without being reduced or cropped and, sec-
C. PROCESSING THE ‘‘VISDRONE’’ DATASET ondly, a network trained using pre-processed images using
The dataset was processed as follows: the algorithm discussed in this study.
• Initial 640 × 360 image to maintain the same proportion The validation was conducted not to determine the quality
as the images in the original dataset. of the model, since it was validated against the same dataset

VOLUME 11, 2023 104599


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

FIGURE 18. Training results for mAP_0.5 of the original and


pre-processed images of the ‘‘Visdrone’’ dataset.

FIGURE 19. mAP_0.5 graph of the time differences in training. The hours
of training for the ‘‘Visdrone’’ dataset are indicated on the horizontal axis.

FIGURE 20. Confusion matrices of the original (above) and pre-processed


with which it was trained. We note that the purpose of this images (below).
article is not to determine the success of the training itself
but rather whether the algorithm succeeds in reducing train-
ing times without any loss of effectiveness. The results in
But in validating the original images, these ‘‘true nega-
themselves are not significant but the differences between
tives’’ are detected as ‘‘true positives’’ by the network trained
results if the network is trained using a pre-processed dataset
with the pre-processed dataset. That is, Network B has a
or the original. Thus, both training results were validated to
greater sensibility to small, non-labelled objects, but posi-
compare them. The terms used in this comparison are:
tives, in the original images.
• Network A: Network resulting from the training based Figure 21 shows an original frame from the video without
on the original dataset. any labelled objects as these are very far from the camera.
• Network B: Network resulting from the training This image was analyzed by both neural networks (Net-
based on the pre-processed dataset generated using the work A and Network B). In the case of Network A, the
algorithm discussed in this study. objects were correctly learned as true negatives and were
not marked (Figure 22.). But in the case of Network B,
A. ‘‘DRONE’’ CASE these distant objects were not inputted into the network, that
Both networks used a validation process against the orig- is, they w ere never marked as ‘‘selectable objects’’ and
inal images, generating the confusion matrices shown in so were never marked as objects to be discarded as ‘‘true
Figure 20. negative’’. Thus, in processing this image, Network B will
These matrices show, in the validation of Network B, that detect these objects as a target if the resolution of the image
is, the network generated from pro-processed images, a slight permits.
increase in the number of ‘‘False Positives’’ especially in Advantages Obtained During Training: In line with the
the type ‘‘car’’. But a closer analysis shows that this is not above, we found that both datasets produce a very similar
correct. In fact, the network has a higher success rate than trained network, even for this dataset. It may be said that the
the labelled original. In the original images, small and distant network generated using the pre-processed dataset is slightly
objects of interest are not labelled to avoid adding noise to better, detecting smaller objects of interest and with fewer
the training process. In the training with the original images false negatives.
these objects are categorized correctly as true negatives, while Thus far, we have demonstrated that the training results
with the cropped images these objects simply are not included are similar, the two networks are equivalent. But this is not
in the training process (neither as true positives nor true the principal advantage of the algorithm which is the training
negatives). process itself where better results are obtained.

104600 VOLUME 11, 2023


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

FIGURE 21. Complete original image with not labelled objects as these
are too far away.

FIGURE 22. Upper right corner amplification of Figure 21, where appear
objects undetected by network A (trained with the original dataset).
FIGURE 24. Confusion matrices of the original images (above) and the
pre-processed images (below).

C. ‘‘VISDRONE’’ CASE
Both networks were validated using the original images,
generating the confusion matrices shown in Figure 25.
Advantages Obtained During Training: For this dataset,
consisting of 33.6K images in FullHD quality, the training
time using the YOLO algorithm and ‘‘Yolov5m’’ network for
30 epochs, was 14 hours and 26 minutes, while the training
time using the same computer for the pre-processed dataset
FIGURE 23. Upper right corner amplification of Figure 21, where appears was 3 hours and 36 minutes.
objects detected by network B (trained with a pre-processed dataset).
Here it is important to note that this training exercise
presented the largest differences, although these are not sig-
nificant if we consider that the network was not trained
B. ‘‘ROUNDABOUT’’ CASE effectively. The results of the training process in both cases,
Both networks were used in a validation process against for the original dataset and the pre-processed dataset, were
the original images, generating the Figure 24 confusion approximately 0.3 in the mAP_0.5 metric, a very poor
matrices. result.
Advantages Obtained During Training: For this dataset, We will explain the reasons for this poor performance
consisting of 65K images in 2K quality, the training time although it is important to note that these results also vali-
using the YOLO algorithm and ‘‘Yolov5m’’ network for date the algorithm which is designed exclusively to reduce
30 epochs, was 3 days, 4 hours, and 3 minutes, while the training times rather than improve the training process
training time using the same computer for the pre-processed itself.
dataset was 1 day, 8 hours and 46 minutes. The reason for this poor training result is because the
This is a perfect example of network training where the network was trained using values downloaded repository
results are the virtually the same, with very little differences without any prior cleaning of the dataset. For this dataset,
between them. The greatest difference, although minimum, the labelled original (not using YOLO) includes special
is in the case of the label ‘‘car’’ where there was a slight types and attributes. Thus, we have a ‘type 0’ to indicate
confusion with ‘‘truck’’. ‘‘regions to ignore’’, see Figure 30, or attributes that indicate

VOLUME 11, 2023 104601


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

FIGURE 26. Sample frames from an uncleaned dataset.

FIGURE 27. Sample frame from the labelled dataset.


FIGURE 25. Confusion matrices of the original (above) and pre-processed
images (below).

if the labelled object is hidden, as shown in Figure 27 and


Figure 28, truncated or even confidence (score) of the labelled
objects.
To improve the training results, it is essential that the
dataset be initially cleaned and filtered of hidden objects, FIGURE 28. Amplification of Fig. 27. showing totally hidden but labelled
highly distorted or cut objects, and dubious labels, relabeling targets (cars).
objects which are unlabeled but as perfectly recognizable
in the images (see Figure 27, Figure 28, Figure 29 and
Figure 30). This was not done here, firstly, because the pur-
pose of this article is not to evaluate the quality of the training
process of neural networks using known datasets but rather
to evaluate the time reductions in training provided by the
algorithm; secondly, a clean dataset with fewer labels can
FIGURE 29. Amplification of Figure 27 showing targets (cars and
optimize the training process, thus, this is further evidence of motorcycles) that are perfectly identifiable and not labelled in the
the effectiveness of our pre-processing system. Regardless, dataset.
the algorithm reduced the training time to one quarter of the
original training time.
This improvement in training times is particularly impor- other parameters that allow the algorithm to be effective, such
tant given that the dataset in this case is not ideal for as the limited movement of objects between frames and many
pre-processing. Figure 26 shows how the images do not meet objects remaining immobile over many frames.
some of the conditions for optimum effectiveness of the By contrast, these images demonstrate the poor results of
algorithm such as the lack of concentration of objects in a spe- the training which, while not a problem for pre-processing,
cific zone of the image. As can be seen, the labelled objects should be taken into consideration. Certain objects are
are distributed throughout the frame. In contrast, it does meet labelled but totally hidden (cars under trees, for example),

104602 VOLUME 11, 2023


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

It is important to note that the static objects of interest


(parked cars, for example) are not only labelled once in
the pre-processed dataset, as shown in Figure 31, discarding
all other appearances because the object doesn’t move, but
are also labelled in every ‘‘key’’ image. By adjusting the
configuration of this value in the algorithm the repercussion
of static objects can be compensated, being very abundant
in the dataset versus objects which appear only in a limited
number of frames.
These two key points that the algorithm addresses primar-
ily achieve:
FIGURE 30. Sample frame from the labelled dataset. Here we see the
upper part of the image is marked as not labelled (red box) while many • Reducing the dataset size in terms of storage space by
objects can be perfectly recognized. 20%. As mentioned earlier, the original and processed
image sets are not vastly different. In our tests, in the
worst case, it doesn’t even double the number of images.
mislabeled or unlabeled (motorcycles, for example). Oddly, However, these images are much smaller, going from
these same motorcycles are labelled in other frames of the around 1.5MB (in jpg format) per original image to
video. There are also zones of the image which are perfectly about 100KB per processed image. This translates to a
recognizable but marked as to be ignored. significant reduction. It’s worth noting that the labeling
file size is negligible in these calculations, as it accounts
VI. DISCUSION for less than 0.01% of the total dataset size.
How is it possible that partitioning an image into smaller • With smaller images, a larger number of images can be
images produces results which are inferior compared to the loaded in parallel into the memory of the graphics cards.
original? In other words, how is it possible that the dataset, In our case, we were able to go from loading 4 images
in addition to being taken from smaller images, generates a in parallel to loading 42 images. This makes the training
set of smaller images? process more efficient.
The explanation is found in the first criterium for eliminat-
ing cut images. That is, in the discarding of cut images which The consequence of these two points results in training times
only include objects of interest that do not move, for example around 20% of the original time, with insignificant variation
parked cars. In many frames the only cars appearing are in the quality of the trained network. Sometimes, it even
parked, with no other vehicles circulating. These cars are only performs better than the original by avoiding overfitting in
labelled once in the ‘‘key’’ frame which the configuration datasets with imbalanced and low-quality targets.
established every 7 images (7 to 1 reduction).
VII. CONCLUSION
An analysis of the results shows that the image pre-processing
algorithm is more efficient in terms of time and computa-
tion, able to be executed using standard equipment without
any outstanding characteristics. Additionally, very significant
improvements were seen in training times with reductions
from at least 50% to, depending on the dataset, reduction
of 80%. If, for example, we focus on a success score of 0.95 in
the mAP_0.5 metric, very significant time reductions were
achieved, as shown in Figure 32:
• Drone Dataset. Training without improvement: 3 hours
FIGURE 31. Example of parked cars (in blue) and circulating cars (in red). and 36 minutes, with pre-processing: 30 minutes.
A reduction in training time of 87%.
The result is that the pre-processed image is not only • Roundabout Dataset. Training without improvement:
smaller but also more equal. A parked car will appear in 21 hours and 11 minutes, with pre-processing: 6 hours
all the frames of the video, giving it greater weight in the and 34 minutes. A reduction in training time of 72%.
training process while a car moving in front of the camera • Visdrone Dataset. A success score of 0.95 for the metric
only appears in the sequence of images for a few seconds. was never achieved for this dataset. The highest success
Thus, a false positive of an object appearing in all the images score was in epoch 9, after 4 hours and 5 minutes for the
will be more highly penalized than a false positive of an original dataset and 1 hour for the pre-processed dataset.
object which is only labelled in 5 or 10 frames. This means A reduction in training time of 76%.
the network can ‘overlearn’ some objects to the detriment of As shown in Figure 33, similar results can be obtained if
others. the aim is simply a specific number of epochs.

VOLUME 11, 2023 104603


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

[2] E. Strubell, A. Ganesh, and A. McCallum, ‘‘Energy and policy consid-


erations for deep learning in NLP,’’ in Proc. 57th Annu. Meeting Assoc.
Comput. Linguistics, 2019, pp. 3645–3650.
[3] H. Mao, S. Yao, T. Tang, B. Li, J. Yao, and Y. Wang, ‘‘Towards real-
time object detection on embedded systems,’’ IEEE Trans. Emerg. Topics
Comput., vol. 6, no. 3, pp. 417–431, Jul. 2018, doi: 10.1109/TETC.
2016.2593643.
[4] J. A. Carballo, J. Bonilla, M. Berenguel, J. Fernández-Reche, and
G. García, ‘‘New approach for solar tracking systems based on computer
vision, low cost hardware and deep learning,’’ Renew. Energy, vol. 133,
pp. 1158–1166, Apr. 2019.
[5] B. Moons, D. Bankman, and M. Verhelst, ‘‘Embedded deep learning,’’
in Algorithms, Architectures and Circuits for Always-on Neural Network
FIGURE 32. Comparison of time in achieving a score of 0.95 in the Processing. Cham, Switzerland: Springer, 2019, doi: 10.1007/978-3-319-
mAP_0.5 metric. 99223-5.
[6] K. Rungsuptaweekoon, V. Visoottiviseth, and R. Takano, ‘‘Evaluating the
power efficiency of deep learning inference on embedded GPU systems,’’
in Proc. 2nd Int. Conf. Inf. Technol. (INCIT), Nov. 2017, pp. 1–5, doi:
10.1109/INCIT.2017.8257866.
[7] G. Plastiras, C. Kyrkou, and T. Theocharides, ‘‘Efficient ConvNet-based
object detection for unmanned aerial vehicles by selective tile pro-
cessing,’’ in Proc. 12th Int. Conf. Distrib. Smart Cameras, Sep. 2018,
pp. 1–6.
[8] O. Rukundo, ‘‘Effects of image size on deep learning,’’ 2021,
arXiv:2101.11508.
[9] C. F. Sabottke and B. M. Spieler, ‘‘The effect of image resolution on deep
learning in radiography,’’ Radiol., Artif. Intell., vol. 2, no. 1, Jan. 2020,
Art. no. e190015.
[10] S. Wu, M. Zhang, G. Chen, and K. Chen, ‘‘A new approach to compute
CNNs for extremely large images,’’ in Proc. ACM Conf. Inf. Knowl.
Manage., Nov. 2017, pp. 39–48, doi: 10.1145/3132847.3132872.
FIGURE 33. Comparison of time in the training of 30 epochs. [11] A. Ramalingam. (2021). How to Pick the Optimal Image Size
for Training Convolution Neural Network. [Online]. Available:
https://medium.com/analytics-vidhya/how-to-pick-the-optimal-image-
size-for-training-convolution-neural-network-65702b880f05
Additionally, it was found that pre-processing does not [12] P. Lakhani, ‘‘The importance of image resolution in building deep learning
models for medical imaging,’’ Radiol., Artif. Intell., vol. 2, no. 1, Jan. 2020,
alter the quality of the training. If the dataset is clean or well Art. no. e190177.
formatted, the training is successful in both cases, as seen in [13] G. A. Reina, R. Panchumarthy, S. P. Thakur, A. Bastidas, and
the Drone and Roundabout datasets while, if the dataset is S. Bakas, ‘‘Systematic evaluation of image tiling adverse effects on deep
not well labelled, the network trains with the same failures as learning semantic segmentation,’’ Frontiers Neurosci., vol. 14, p. 65,
Feb. 2020.
with the original. [14] A. L. S. Lee, C. C. K. To, A. L. H. Lee, J. J. X. Li, and
To conclude, it is important to note the added benefit that R. C. K. Chan, ‘‘Model architecture and tile size selection for convolu-
a network trained with a pre-processed dataset tends to be tional neural network training for non-small cell lung cancer detection
on whole slide images,’’ Informat. Med. Unlocked, vol. 28, Jan. 2022,
more precise in distant, unlabeled objects, as can be seen in Art. no. 100850.
Fig. 5, Figure 21 and Figure 22. In the complete images these [15] K. Tong and Y. Wu, ‘‘Deep learning-based detection from the perspective
objects are trained as true negatives while in the pre-processed of small or tiny objects: A survey,’’ Image Vis. Comput., vol. 123, Jul. 2022,
Art. no. 104471, doi: 10.1016/j.imavis.2022.104471.
network these objects are not part of the training. Thus, these
[16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
objects are detected in the image during the training process, Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
but in the validation, they are detected as false positives since Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
these are not marked in the original dataset. [17] L. F. Cordeiro, ‘‘Development of customized dataset for training YOLO
as a real-time object detection system, for Robot Arm environment,’’
M.S. thesis, Polytechnical Univ. Valencia, Valencia, Spain, 2019.
ACKNOWLEDGMENT
[18] X. Zhao, Y. Ni, and H. Jia, ‘‘Modified object detection method based on
(Sergio Bemposta Rosende and Javier Sánchez-Soriano con- YOLO,’’ in Proc. CCF Chin. Conf. Comput. Vis., vol. 773. Singapore:
tributed equally to this work.) The authors would like to Springer, 2017, doi: 10.1007/978-981-10-7305-2_21.
thank Universidad Francisco de Vitoria and the European [19] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’
2018, arXiv:1804.02767.
University of Madrid for their support. They are especially [20] S. B. Rosende, S. Ghisler, J. Fernández-Andrés, and J. Sánchez-Soriano,
grateful to the translation service of Universidad Francisco ‘‘Dataset: Traffic images captured from UAVs for use in training machine
de Vitoria for their help in translating and revising the vision algorithms for traffic management,’’ Data, vol. 7, no. 5, p. 53,
Apr. 2022.
manuscript. [21] E. Puertas, G. De-Las-Heras, J. Fernández-Andrés, and
J. Sánchez-Soriano, ‘‘Dataset: Roundabout aerial images for vehicle
REFERENCES detection,’’ Data, vol. 7, no. 4, p. 47, Apr. 2022, doi: 10.3390/data7040047.
[1] V. Kovalev, A. Kalinovsky, and V. Liauchuk, ‘‘Deep learning in big image [22] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, ‘‘Detection and
data: Histology image classification for breast cancer diagnosis,’’ in Proc. tracking meet drones challenge,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
Int. Conf. BIG DATA Adv. Anal. (BSUIR), Jun. 2016, pp. 44–53. vol. 44, no. 11, pp. 7380–7399, Nov. 2022.

104604 VOLUME 11, 2023


S. B. Rosende et al.: Optimization Algorithm to Reduce Training Time

SERGIO BEMPOSTA ROSENDE received the JAVIER SÁNCHEZ-SORIANO received the


degree in computer engineering and the master’s degree in computer engineering, the mas-
degree in big data analytics from the Euro- ter’s degree in information technologies, and the
pean University of Madrid, in 2002 and 2018, Ph.D. degree in artificial intelligence from the
respectively. In 2004, he joined the Computer Polytechnic University of Madrid, Spain, in 2009,
Systems Department, European University of 2010, and 2016, respectively. In 2012, he joined
Madrid, as an Associate Professor, where he is the Computer Systems Department, European
currently an Associate Professor. His research University of Madrid, as an Associate Professor,
interests include robotics, drones, machine learn- where he has been a Professor until 2022. Since
ing, computer vision, and intelligent transportation 2022, he has been an Associate Professor with
systems. the Polytechnic School, Universidad Francisco de Vitoria. His research
interests include machine learning, autonomous driving, computer vision,
and intelligent transportation systems.

JAVIER FERNÁNDEZ-ANDRÉS received the


degree in industrial engineering and the Ph.D.
degree in robotics and computer vision from
the Polytechnic University of Madrid, Spain, in
1992 and 1998, respectively. In 1998, he joined
the Computer Systems Department, European
University of Madrid, as an Associate Profes-
sor, where he has been a Professor until 2004.
From 2004 to 2012, he was the Chairperson of the
Department of Computer Systems and Automa-
tion. Since 2012, he has been a Full Professor with the Department of
Engineering, European University of Madrid. His research interests include
computer vision, intelligent transportation systems, pattern recognition, and
machine learning.

VOLUME 11, 2023 104605

You might also like