Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
63 views

Object Detection Using Yolo Algorithm-1

Uploaded by

Mayur Kundra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

Object Detection Using Yolo Algorithm-1

Uploaded by

Mayur Kundra
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Object Detection Using Yolo Algorithm

Introduction
The primary objective of Object Detection is to accurately locate and identify
all objects present within an image. This task hinges on training a system to
autonomously learn the process of object detection from a dataset. However,
achieving real-time object recognition poses a formidable challenge due to the
intricacies involved in searching and recognizing objects swiftly. Despite
ongoing research endeavors, existing methods often fall short in terms of
efficiency, lengthy training durations, impracticality for real-time applications,
and limited scalability across diverse object classes.

While identifying a single specific object is relatively straightforward,


distinguishing between multiple objects, even those of the same category,
presents a significant challenge, particularly for machines lacking
comprehensive knowledge of potential object variations. Object detection finds
wide-ranging applications in domains such as healthcare, traffic management,
and autonomous vehicles, where precise object identification is crucial for
optimal system performance.

Developing a real-time object detection system requires navigating through


various techniques, including image classification, where images are assigned
class labels, and object localization, which involves outlining bounding boxes
around detected objects. Object detection integrates both tasks, necessitating
simultaneous object identification via bounding boxes and the assignment of
class labels to each identified object within an image.

Literature Survey:

YOLOv4 detects the object with high accuracy when compared to other
algorithms like CNN,
RCNN etc. YOLOv4 is more efficient because it is a mixture of Convolutional
Neural
Networks (CNNs) and sliding window approach which can achieve 65.7%
Average Precision
accuracy in accordance with the Microsoft COCO dataset. The first model that
produced a 30%
improvement in object detection is RCNN. The very similar approach to RCNN is
Fast RCNN,
Selective search was used in Fast RCNN to detect an object. After Fast RCNN,
Faster RCNN came in
for object detection.
Though Selective Search was serviceable it took lot of time to detect the
object. In 2015, SSD
[Single Shot MultiBox Detector] came to detect multiple objects at single shot.
It also increased the
detection rate. Anchors were used in the SSD to count the default regions.
From the name, we can
clearly say that it takes a single shot to detect multiple images. The YOLO group
of architectures
were constructed in the same vein as the SSD architectures but YOLO is more
advantageous than
any other method for object detection as it has high accuracy in object
detection.

Existing System:

There is various real time object detection with voice output models using
different algorithms
like CNN, RCNN, Faster RCNN, YOLOv3 etc. The problem with these algorithms
is accuracy is less and real time speed of object detection is low.
Yolo v4 Architecture:
In real time object detection using YOLOv4 the image goes through several
convolutional layers to form a feature map. The images are divided into grid
cells. Each grid cells in YOLO algorithm generate two anchors each. In every
object detection model the following steps take place, data augmentation is
done on the input images where the different orientation of same images is
trained to improve the accuracy. Then Normalization is done to improve the
quality of images.
Regularization is done to adjust the output within the range. Loss functions are
used to calculate the losses and tries to reduce the loss using backtracking
algorithms. Classification and bounding boxes are done on the images captured
by camera and detection is done. The work is done by dividing the detection
task into two categories, one is detection and other one is classification. We
use darknet framework for implementation of YOLOv4.
YOLOv4 has three important parts:
• Backbone
• Neck
• Head

CutMix,Mosaic data augumentation,class label smoothing and dropblock


regularization are used to increase the classifier training accuracy. Mish
activation function is also used in addition for classification and training.
Instead of using a single image Mosaic data augmentation uses 4-image at the
same time for better processing.
CutMix data augmentation: we use different orientation and random patches
between the training images. Localization ability is increased on less
discriminative parts of the object to be classified.
Class label smoothing: Label smoothing is a regularization technique that
addresses overfitting and overconfidence while classification. Mish Activation
function: It is a self-regularized function which is defined in mathematical
terms below:
f(x)=x tanh(soft plus(x))
The detector performance is increased by using SPP,PAN, and SAM.
SPP: SPP (Spatial pyramid pooling) On simultaneous pooling on multiple kernel
sizes Spatial pyramid pooling acquires both the coarse and fine information
which are required for further process.
PAN: Path Aggregation Network is a technique that uses maximum information
in layers close to the input by inspecting features from different convolution
layers.
SAM: Modified Spatial Attention Module is used to highlight the most
important and miniature features. For further optimization CSPNet divides the
feature map of the base layer into two segments and then merging them
together using a cross-stage hierarchy. In yolov4 CIoU-loss is used to reduce the
errors in boundingboxes by using midpoints of actual bounding boxes and
predicted bounding boxes. CmBN(cross mini batch normalization) technique is
used, and the results is that it decreases the cost of training. Bag Of Freebies
for backbone are Class label smoothing, DropBlock regularization, CutMix . Bag
of Specials for backbone are Cross-stage partial connections and Mish
activation. Bag of Freebies for detector are CmBN, DropBlock regularization,
Self- Adversarial Training, eliminate grid sensitivity, Cosine annealing scheduler,
Optimal hyperparameters, Random training shapes. Bag of Specials (BoS) for
detector are Mish activation, SPP block, SAM-block, PAN path-aggregation
block, DIoUNMS. After the introduction of Bag of features and Bag of species
classification and image detection became very easy and anyone can use this
for training the model.

Related Work:
Presently, automated systems employ object detection techniques utilizing
older CNN algorithms. However, we employ the Yolo algorithm, a newer
approach, for quicker and more accurate object detection, enhancing detection
speed and response times. Over recent years, Yolo has rapidly evolved and
achieved significant success in computer vision research, owing to its efficient
algorithms and effective adaptation of object detection techniques. Object
detection has made remarkable strides in computer vision, with current
research elucidating the underlying workings of these techniques.
In object detection, the system initially identifies the location and scale of
objects within an image. The primary objective of the object detector is to
identify any number of objects belonging to a particular class, regardless of
their type, location, or size in the input image. Object detection typically serves
as the initial step in computer systems, enabling them to gather additional
information such as recognizing specific instances (e.g., human faces), tracking
objects across image sequences (e.g., action tracking), and obtaining detailed
object information.

Object detection techniques find applications in various domains, including


human-computer interaction (e.g., Siri or Alexa), robotics, smartphones, data
tracking, and search engines (e.g., Google, Firefox, and targeted
advertisements). Each application of object detection features distinct
characteristics; some focus on facial recognition, while others involve updating
previously searched data automatically. Different systems may handle single-
object detection within a single view or detect single objects across multiple
views. The output of these systems often depends on the views provided
during training.

The Proposed Framework:

Our proposal aims to achieve swift and accurate image detection by leveraging
the YOLO (You Only Look Once) Algorithm. YOLO stands out from previous
object detection algorithms by considering the entire image rather than
specific regions, thereby enhancing efficiency. Unlike region-based methods,
YOLO employs a single convolutional network to predict bounding boxes and
class probabilities for objects.

Image processing in YOLO involves dividing the image into an SxS grid and
generating m bounding boxes within each grid, each with associated class
probability and offset values. The algorithm's processing speed is notable,
capable of handling 45 frames per second. However, it struggles with small
objects due to spatial constraints, such as the flicker of birds.

Distinct from other detection algorithms, YOLO takes the holistic view of the
object, forming bounding boxes around entire objects in a single instance. Its
efficiency lies in processing 45 frames per second and representing data in
vector form: Y = (pc, bx, by, bh, bw, c1, c2, c3), where pc indicates object
presence probability, bx, by, bh, bw denote object class presence, and c1, c2, c3
represent object classes.

For scenarios with multiple bounding boxes, YOLO employs non-max


suppression to select the most accurate bounding box while discarding others.
This suppression is based on the intersection over union (IoU) formula,
comparing the intersection and union areas of bounding boxes.

CNN (Convolutional Neural Network) has revolutionized object detection in


recent decades by effectively handling vast amounts of data. It gained
prominence in 2012 when AlexNet won the ImageNet Computer Vision contest
with 84% accuracy, utilizing CNN for object detection. CNN's role in computer
vision techniques is pivotal, especially demonstrated by its incorporation into
the winning project of the ImageNet competition.

CNN comprises artificial neural network layers housing artificial neurons, which
mimic biological neurons by computing weighted sums of inputs to produce
activation values. Each layer in CNN detects specific features, progressively
identifying complex patterns from edges to intricate objects like faces and
birds.
Implementation:

The YOLO (You Only Look Once) Algorithm operates by taking an image as
input, partitioning it into an SxS grid, with each grid containing m bounding
boxes. Within these boxes, both the class probability and offset value are
stored. Detection occurs for bounding boxes with class probabilities surpassing
a predefined threshold. Renowned for its exceptional speed, YOLO processes
45 frames per second, making it one of the fastest algorithms in computer
vision. However, it faces challenges in accurately detecting small objects, like
flocks of birds, due to spatial limitations.

Distinguished from other object detection methods, YOLO processes the entire
object in a single instance, emphasizing its remarkable speed of 45 frames per
second. It forms bounding boxes around objects, assigns class probabilities,
and demonstrates an understanding of object generalization.

For training the dataset, we employ both forward and backward propagation
models. During testing, we feed the image into the system, conducting forward
propagation until the desired output is obtained. In real-world applications,
grid sizes are typically large, often reaching dimensions like 20x20, depending
on the input provided.

Algorithm for Object Detection: -


INPUT: Trained Image dataset, Testing Trained image dataset
OUTPUT: Input image labelled with its class name along with the Average
precession value with Rectangular
box around every Object.
1: Pass the Input Image for detection
2: yolo Algorithm for Image
3: Yolo Algorithm process the image
4: Based on the trained dataset it assigns the class names
5: Verify threshold value crosses 0.5 or less
6: If the threshold is less than 0.5 then it is original and provides the mean
Procession value otherwise it simply
ignores.
Results:

When passing an image for input, it undergoes a division into a grid structure
mirroring the training dataset's grid layout. Each grid cell yields an output of
3x3x19.16 values, consistent with the prediction model. The initial value
signifies the probability of an object belonging to a specific class. The
subsequent eight values pertain to the first anchor box's characteristics,
including bounding box coordinates and class information. Likewise, the
following set of eight values corresponds to the second anchor box,
maintaining the same format.

Subsequently, non-maximum suppression is applied to each bounding box to


consolidate them into singular entities, mitigating redundancy.

The training process for YOLO unfolds as follows:

Input image, typically with dimensions (608, 608, 3).


The image is fed into a CNN, resulting in an output tensor of dimensions (19,
19, 5, 85), which is then flattened across the last two dimensions to yield (19,
19, 425).
In this configuration:
Each grid cell (19x19) generates 425 values.
The product of 85 and 5 yields 425, with 5 representing the number of anchor
boxes per grid.
Of these, 85 represents center coordinates, width, height, x, and y values for
bounding boxes, while 80 denotes the total classes for detection.
Finally, non-maximum suppression is applied to refine bounding boxes,
ensuring the retention of only the most relevant box per detected object while
eliminating overlap.

Conclusion and Future Scope:

In light of the increasing popularity of object detection applications across


various domains, we have developed a console-based application. This
application accepts an image as input and outputs the same image with object
names detected, displayed atop bounding boxes drawn around the identified
objects. To train our custom dataset, we utilized Google Colab, employing
supervised learning techniques facilitated by LabelImg for data labeling.
Leveraging the YOLO algorithm, we ensure swift results and accuracy, utilizing
the latest version for optimal performance.

You might also like