Object Detection Using Yolo Algorithm-1
Object Detection Using Yolo Algorithm-1
Introduction
The primary objective of Object Detection is to accurately locate and identify
all objects present within an image. This task hinges on training a system to
autonomously learn the process of object detection from a dataset. However,
achieving real-time object recognition poses a formidable challenge due to the
intricacies involved in searching and recognizing objects swiftly. Despite
ongoing research endeavors, existing methods often fall short in terms of
efficiency, lengthy training durations, impracticality for real-time applications,
and limited scalability across diverse object classes.
Literature Survey:
YOLOv4 detects the object with high accuracy when compared to other
algorithms like CNN,
RCNN etc. YOLOv4 is more efficient because it is a mixture of Convolutional
Neural
Networks (CNNs) and sliding window approach which can achieve 65.7%
Average Precision
accuracy in accordance with the Microsoft COCO dataset. The first model that
produced a 30%
improvement in object detection is RCNN. The very similar approach to RCNN is
Fast RCNN,
Selective search was used in Fast RCNN to detect an object. After Fast RCNN,
Faster RCNN came in
for object detection.
Though Selective Search was serviceable it took lot of time to detect the
object. In 2015, SSD
[Single Shot MultiBox Detector] came to detect multiple objects at single shot.
It also increased the
detection rate. Anchors were used in the SSD to count the default regions.
From the name, we can
clearly say that it takes a single shot to detect multiple images. The YOLO group
of architectures
were constructed in the same vein as the SSD architectures but YOLO is more
advantageous than
any other method for object detection as it has high accuracy in object
detection.
Existing System:
There is various real time object detection with voice output models using
different algorithms
like CNN, RCNN, Faster RCNN, YOLOv3 etc. The problem with these algorithms
is accuracy is less and real time speed of object detection is low.
Yolo v4 Architecture:
In real time object detection using YOLOv4 the image goes through several
convolutional layers to form a feature map. The images are divided into grid
cells. Each grid cells in YOLO algorithm generate two anchors each. In every
object detection model the following steps take place, data augmentation is
done on the input images where the different orientation of same images is
trained to improve the accuracy. Then Normalization is done to improve the
quality of images.
Regularization is done to adjust the output within the range. Loss functions are
used to calculate the losses and tries to reduce the loss using backtracking
algorithms. Classification and bounding boxes are done on the images captured
by camera and detection is done. The work is done by dividing the detection
task into two categories, one is detection and other one is classification. We
use darknet framework for implementation of YOLOv4.
YOLOv4 has three important parts:
• Backbone
• Neck
• Head
Related Work:
Presently, automated systems employ object detection techniques utilizing
older CNN algorithms. However, we employ the Yolo algorithm, a newer
approach, for quicker and more accurate object detection, enhancing detection
speed and response times. Over recent years, Yolo has rapidly evolved and
achieved significant success in computer vision research, owing to its efficient
algorithms and effective adaptation of object detection techniques. Object
detection has made remarkable strides in computer vision, with current
research elucidating the underlying workings of these techniques.
In object detection, the system initially identifies the location and scale of
objects within an image. The primary objective of the object detector is to
identify any number of objects belonging to a particular class, regardless of
their type, location, or size in the input image. Object detection typically serves
as the initial step in computer systems, enabling them to gather additional
information such as recognizing specific instances (e.g., human faces), tracking
objects across image sequences (e.g., action tracking), and obtaining detailed
object information.
Our proposal aims to achieve swift and accurate image detection by leveraging
the YOLO (You Only Look Once) Algorithm. YOLO stands out from previous
object detection algorithms by considering the entire image rather than
specific regions, thereby enhancing efficiency. Unlike region-based methods,
YOLO employs a single convolutional network to predict bounding boxes and
class probabilities for objects.
Image processing in YOLO involves dividing the image into an SxS grid and
generating m bounding boxes within each grid, each with associated class
probability and offset values. The algorithm's processing speed is notable,
capable of handling 45 frames per second. However, it struggles with small
objects due to spatial constraints, such as the flicker of birds.
Distinct from other detection algorithms, YOLO takes the holistic view of the
object, forming bounding boxes around entire objects in a single instance. Its
efficiency lies in processing 45 frames per second and representing data in
vector form: Y = (pc, bx, by, bh, bw, c1, c2, c3), where pc indicates object
presence probability, bx, by, bh, bw denote object class presence, and c1, c2, c3
represent object classes.
CNN comprises artificial neural network layers housing artificial neurons, which
mimic biological neurons by computing weighted sums of inputs to produce
activation values. Each layer in CNN detects specific features, progressively
identifying complex patterns from edges to intricate objects like faces and
birds.
Implementation:
The YOLO (You Only Look Once) Algorithm operates by taking an image as
input, partitioning it into an SxS grid, with each grid containing m bounding
boxes. Within these boxes, both the class probability and offset value are
stored. Detection occurs for bounding boxes with class probabilities surpassing
a predefined threshold. Renowned for its exceptional speed, YOLO processes
45 frames per second, making it one of the fastest algorithms in computer
vision. However, it faces challenges in accurately detecting small objects, like
flocks of birds, due to spatial limitations.
Distinguished from other object detection methods, YOLO processes the entire
object in a single instance, emphasizing its remarkable speed of 45 frames per
second. It forms bounding boxes around objects, assigns class probabilities,
and demonstrates an understanding of object generalization.
For training the dataset, we employ both forward and backward propagation
models. During testing, we feed the image into the system, conducting forward
propagation until the desired output is obtained. In real-world applications,
grid sizes are typically large, often reaching dimensions like 20x20, depending
on the input provided.
When passing an image for input, it undergoes a division into a grid structure
mirroring the training dataset's grid layout. Each grid cell yields an output of
3x3x19.16 values, consistent with the prediction model. The initial value
signifies the probability of an object belonging to a specific class. The
subsequent eight values pertain to the first anchor box's characteristics,
including bounding box coordinates and class information. Likewise, the
following set of eight values corresponds to the second anchor box,
maintaining the same format.