Optimization Algorithm To Reduce Training Time For Deep Learning Computer Vision Algorithms Using Large Image Datasets With Tiny Objects
Optimization Algorithm To Reduce Training Time For Deep Learning Computer Vision Algorithms Using Large Image Datasets With Tiny Objects
ABSTRACT The optimization of convolutional neural networks (CNN) generally refers to the improvement
of the inference process, making it as fast and precise as possible. While inference time is an essential factor
in using these networks in real time, the training of CNNs using very large datasets can be costly in terms
of time and computing power. This study proposes a technique to reduce the training time by an average of
75% without altering the results of CNN training with an algorithm which partitions the dataset and discards
superfluous objects (targets). This algorithm is a tool that pre-processes the original dataset, generating a
smaller and more condensed dataset to be used for network training. The effectiveness of this tool depends on
the type of dataset used for training the CNN and is particularly effective with sequential images (video), large
images and images with tiny targets generally from drones or traffic surveillance cameras (but applicable
to any other type of image which meets the requirements). The tool can be parameterized to meet the
characteristics of the initial dataset.
INDEX TERMS Computer vision, dataset, deep learning, training optimization, OpenCV, YOLO.
still provide sufficient information for the training A. TERMS USED IN ALGORITHM DEFINITION
algorithm [9] [13]. • Target (Object) or BoundingBox: A labelled element
2. Partition of the original image into a mosaic of in the image that the neural network should detect. This
images [7], [14], [15]. This method reduces the size may be any of the type of object that the future neural
of the image, dividing it into several parts with a prede- network will detect by inference.
fined size (usually 3×3 or 4×4) with equal dimensions • Selected object: An object labelled in the image that has
(length and width) to maintain the same proportions as been selected as input for the neural network. This object
the original image. was chosen to be part of the set of objects used to train
Both methods reduce the size of images, which can be the neural network.
processed using more modest hardware, particularly when • Discarded object: An object labelled to be discarded as
memory is the principal limitation to processing large images. input in the neural network. This object may be dupli-
Both methods, however, have certain drawbacks: cated, cut, etc. and is discarded for training purposes.
• Cropped region: Portion of the image surrounding a
• Image size reduction [16]. If objects are small, the
‘‘selected object’’. The size of this region is a config-
loss of resolution may mean these objects become unde-
urable parameter of the algorithm. The region is the
tectable.
piece of the image inputted into the neural network for
• Partition of the original image into a mosaic of
training in which there is at least one selected object.
images. The image being processed may be smaller but
• Key image: An image on which the object discard-
there are more images to process. Additionally, objects
ing process is not applied. This is established every
may be cut between two images. The superimposition
‘‘N’’ images. This ‘‘N’’ parameter is configurable in the
of the regions is a way to minimize this although it
algorithm.
does not solve the problem as the area of superimpo-
sition must be very large resulting in an even greater The difference between an ‘‘object’’ and a ‘‘selected object’’
reduction of the object, reducing the effectiveness of this is that not all marked objects in the image to be recognized
solution. are part of the input for training the neural network. Of all the
labelled objects, only a subset of these per image will be part
In this study we propose a method to optimize training times
of the input of the neural network, the rest being discarded.
without the losses indicated above. This method was vali-
dated in a case study using traffic images captured by drone.
B. ALGORITHM
This involved a handicap because the objects of interest were
This algorithm, as opposed to the methods described in the
very small compared with the total size of the image. Thus,
bibliography, consists of two phases:
a solution to reduce the original image was ruled out. For
example, the size of a car or pedestrian in an image taken 1. Discard of objects and reduction of the training set.
by a drone at a height of 50 meters may be approximately 2. Cropping of the training regions and new labelling of
20 × 20 pixels, if we reduce the image to a size that can be objects.
processed by a PyTorch or Tensor Flow type network, that is, To clarify, we will use a training dataset from high-definition
up to 640×640 pixels, we are reducing the image to one-fifth, videos or consecutive images taken in short time intervals.
and the objects will be too small to be accurately detected In either case, these images are from a great distance where
by the neural networks. Although YOLO can theoretically be each image contains various marked targets for training with a
trained using target as small as 2×2 pixels [17], our tests with very small size considering the total size of the image. Each of
targets smaller than 16 × 16 pixels had a very low degree of these images is inputted into the algorithm in the same order
precision. they were taken by the camera (see flow chart of all the steps
In this study we will describe the method used to sig- of the algorithm in Figure 1).
nificantly reduce processing times without diminishing the In the first phase, that of discarding, all targets are
effectiveness of the trained network. checked against the objects in the previous image. The first
image of the dataset is considered a ‘‘key’’ image, so no target
II. TRAINING OPTIMIZATION ALGORITHM is discarded, and this phase is omitted. If these targets show
This algorithm is designed to pre-process the labelled images relatively little movement compared to the previous image
of a dataset prior to being used in the habitual training process they are discarded; they will not be selectable objects and will
for a deep neural network. The dataset must be labelled be discarded. This parameter, ‘‘relatively small distance’’,
using the format of a YOLO type network [18]. Thus, the is configurable. The values which given the best results are
input of this algorithm is one dataset, and the output another 1% to 3% of the total image. In 2K or 4K resolution images
dataset constructed using the original images but optimized these are approximately 5 to 15 pixels. The principal factors
for training (also in YOLO format). For datasets other than affecting the selection of this parameter are:
those of the YOLO type, the labels can be translated for use • Type of recorded scene. From very static scenes to
in other formats. For this reason, the method is replicable and scenes with lots of movement. The more the objects
extendible to other dataset formats. move the greater the discrimination distance.
• Number of Frames Per Second (FPS) at input. When the region does not extend beyond the limits of the image,
the sequence of images is very close in time objects have maintaining the same proportions and size.
a smaller displacement between frames. The higher the Each of the cropped regions is checked for other objects,
FPS the less the discrimination distance. including those discarded in the first phase. For each of
• Rotation of objects. If the target objects in the image the objects identified within the region one of the following
move in rotation, that is, around a central axis within the options is applied:
BoundingBox rather than moving across the image, this The Object Is Entirely Within the Cropped Region: This
may cause a loss of the object for training. In this case, object is labelled to be part of the training. If the object is
pre-processing is simply not recommended. selectable, that is, not discarded in the first phase, it will now
be marked as ‘‘not selectable’’ as it is now part of a training
region.
sub-images are generated (green boxes) containing the train- but only that there be static objects of interest. This is
ing targets. In this example we see how one of the labelled a factor that reduces the set of images for training, thus
targets is partially blurred, marked in white, because it was reducing the size of the dataset and making the process
marked as discardable in one of the regions or sub-images faster.
because it there is less than a 50% overlap.
A. ‘‘DRONE’’ DATASET
This dataset consists of images of road traffic in Spain [21],
with 12 video sequences recorded by a UAV (Unmanned
Arial Vehicle) or drone and from static cameras. These FIGURE 8. Frames extracted from the dataset, corresponding to a parking
are principally images of critical traffic points such as lot, an intersection, and a rotunda with different intensities of traffic.
intersections and roundabouts. The videos are recorded at
1 frame per second in 4K resolution. The total dataset con- It should be noted that the data set was compiled using
sists of 17,570 images of marked objects (types) such as several different drones in various scenarios and under diverse
‘‘cars’’ and ‘‘motorcycles’’. In total there are over 155,000 weather and lighting conditions. These frames were manually
labelled objects in the dataset: 137,000 cars (88.6%) and annotated with specific objects of interest such as pedestrians,
18,000 motorcycles (11.4%). Three frames extracted from the cars, bicycles, and tricycles. Other important attributes are
dataset are presented in Figure 6. also provided such as visibility of the scene, type of object and
ambient occlusion for a better use of the data. Three sample
frames from this dataset are provided in Figure 8.
B. ‘‘ROUNDABOUT’’ DATASET
This dataset consists of areal images of rotundas in Spain
taken with a drone [22], along with their respective annota-
tions in XML (PASCAL VOC) files indicating the position
of the vehicles. In total, the dataset consists of 54 sequences
of drone video with a central view of roundabouts. There
are a total of over 65,000 images with a resolution of For our study, we used only 79 sequences of video con-
1920 × 1080 with 245,000 labelled objects (types): 236,000 sisting of 33,600 frames. There are a total of over 1.5 million
cars (96.4%), 4,900 motorcycles (2.0%), 2,000 trucks (0.9%) labelled items in the dataset, distributed as shown in Table 1.
and 1,700 buses (0.7%). Three frames extracted from the
dataset are presented in Figure 7. IV. PRE-PROCESSING OF THE DATASETS
The three datasets were pre-processed using the algorithm
C. ‘‘VISDRONE’’ DATASET discussed in this study, using the following equipment: an
This dataset is a largescale reference point with carefully ninth generation Intel i7 processor with 64Gb RAM, SSD
annotated data for a computer vision of drone images. hard drive and RTX 2060 graphics card with 8Gb RAM.
For software, the study used Microsoft Visual C++ and the the mAP metric adjusted to the value 0.5. The training results
OpenCV v4.5 library for their facility in generating compila- in the different epochs are shown in Figure 11.
tion files for both Windows and Linux. For this dataset, consisting of 17K images in 2K quality, the
training time using the YOLO algorithm and ‘‘Yolov5m’’ net-
A. PROCESSING THE ‘‘DRONE’’ DATASET work for 20 epochs, was 14 hours and 46 minutes, while the
The dataset was processed as follows: training time using the same computer for the pre-processed
• Initial 640 × 360 image to maintain the same proportion dataset was 1 hour and 35 minutes. If we reanalyze the
as the images in the original dataset. graph of the mAP_0.5 metric but considering training time
• Objects of interest were discarded with their position rather than epochs (Figure 12.), we see a time reduction of
does not vary in 10px of the image. some 89.3%.
• Deletion of counted objects when their area is less
than 50%.
• Key image every 7 frames.
After pre-processing the original dataset, the set is reduced to
some 15,000 images, with 43,000 labelled objects, of which
36,000 are cars (82.4%) and 7,600 motorcycles (17.6%).
A comparison of the original and pre-processed datasets is
provided in Figure 9 and Figure 10.
FIGURE 12. mAP_0.5 graph of the time differences in training. The hours
of training are indicated on the horizontal axis.
FIGURE 14. Evolution of the number of labels assigned to each type after
pre-processing of the ‘‘Roundabout’’ dataset. Cars are the most affected
type with a significant increase in the number of labels.
FIGURE 19. mAP_0.5 graph of the time differences in training. The hours
of training for the ‘‘Visdrone’’ dataset are indicated on the horizontal axis.
FIGURE 21. Complete original image with not labelled objects as these
are too far away.
FIGURE 22. Upper right corner amplification of Figure 21, where appear
objects undetected by network A (trained with the original dataset).
FIGURE 24. Confusion matrices of the original images (above) and the
pre-processed images (below).
C. ‘‘VISDRONE’’ CASE
Both networks were validated using the original images,
generating the confusion matrices shown in Figure 25.
Advantages Obtained During Training: For this dataset,
consisting of 33.6K images in FullHD quality, the training
time using the YOLO algorithm and ‘‘Yolov5m’’ network for
30 epochs, was 14 hours and 26 minutes, while the training
time using the same computer for the pre-processed dataset
FIGURE 23. Upper right corner amplification of Figure 21, where appears was 3 hours and 36 minutes.
objects detected by network B (trained with a pre-processed dataset).
Here it is important to note that this training exercise
presented the largest differences, although these are not sig-
nificant if we consider that the network was not trained
B. ‘‘ROUNDABOUT’’ CASE effectively. The results of the training process in both cases,
Both networks were used in a validation process against for the original dataset and the pre-processed dataset, were
the original images, generating the Figure 24 confusion approximately 0.3 in the mAP_0.5 metric, a very poor
matrices. result.
Advantages Obtained During Training: For this dataset, We will explain the reasons for this poor performance
consisting of 65K images in 2K quality, the training time although it is important to note that these results also vali-
using the YOLO algorithm and ‘‘Yolov5m’’ network for date the algorithm which is designed exclusively to reduce
30 epochs, was 3 days, 4 hours, and 3 minutes, while the training times rather than improve the training process
training time using the same computer for the pre-processed itself.
dataset was 1 day, 8 hours and 46 minutes. The reason for this poor training result is because the
This is a perfect example of network training where the network was trained using values downloaded repository
results are the virtually the same, with very little differences without any prior cleaning of the dataset. For this dataset,
between them. The greatest difference, although minimum, the labelled original (not using YOLO) includes special
is in the case of the label ‘‘car’’ where there was a slight types and attributes. Thus, we have a ‘type 0’ to indicate
confusion with ‘‘truck’’. ‘‘regions to ignore’’, see Figure 30, or attributes that indicate