Object Detection Using AI
Object Detection Using AI
BELGAUM- 590018
An Internship Report on
BACHELOR OF ENGINEERING
in
ELECTRONICS AND COMMUNICATION ENGINEERING
Submitted by
NACHITHA Y K (4GH18EC025)
CERTIFICATE
This is to certify that the internship work entitled “OBJECT DETECTION USING AI” is
bonafide work carried by NACHITHA Y K(4GH18EC025). In partial fulfillment for the award
of degree Bachelor of Engineering in Electronics and Communication Engineering in
Government Engineering College, Hassan Visveswaraya Technological University,
Belagaum-590014 during the year 2020-21.
It is certified that all the corrections/suggestions indicated for the internal assessment has been
approved as it satisfies the academic requirements with respect of internship work prescribed for
the said degree.
……………………….. ………………………….. .........……………….
Guide HOD Principal
Mrs. Pallavi H V Mr. Neelappa Dr. Prashanth S
External Viva-Voce
1)
2)
ABSTRACT
Data is the new oil in current technological society. The impact of efficient data has changed
benchmarks of performance in terms of speed and accuracy. The enhancement is visualizable
because the processing of data is performed by two buzzwords in industry called Computer Vision
(CV) and Artificial Intelligence (AI). Two technologies have empowered major tasks such as object
detection and tracking for traffic vigilance systems. As the features in image increases demand for
efficient algorithm to excavate hidden features increases. Convolution Neural Network (CNN)
model is designed for urban vehicle dataset for single object detection and YOLOv3 for multiple
object detection on KITTI and COCO dataset. Model performance is analyzed, evaluated and
tabulated using performance metrics such as True Positive (TP), True Negative (TN), False Positive
(FP), False Negative (FN), Accuracy,
Precision, confusion matrix and mean Average Precession (mAP). Objects are tracked across the
frames using YOLOv3 and Simple Online Real Time Tracking (SORT) on traffic surveillance video.
This paper upholds the uniqueness of the state of the art networks like DarkNet. The efficient
detection and tracking on urban vehicle dataset is witnessed. The algorithms give real-time, accurate,
precise identifications suitable for realtime traffic applications.
(i)
ACKNOWLEDGEMENT
I present with an immense pleasure, this work titled “OBJECT DETECTION USING AI”. I
express our heartful thanks to our beloved Principal, Dr. Prashanth S, GEC, Hassan for his
encouragement throughout our studies.
At the outset I express our most sincere thanks to Mr. Neelappa ,Head of the department ,
Department of E&CE, for his continuous support and advice not only during the course of our
internship work but also during the period of our stay in GECH.
I express my gratitude towards our internship guide and internship head Mrs. Pallavi H V,
Associate Professor, Department of E&CE, for her encouragement and support throughout our
work.
Finally I express my thanks to all teaching staff of Dept. of E&CE, fellow classmates and my
parents for their timely support and suggestions.
I was conscious of the fact that I received co-operation in many ways from the teaching and non-
teaching staff of the Department of Electronics and Communication Engineering and grateful to
all their co-operation and their guidance in completing our task well in time. We thank one and
all who have been helped me one way or the other in completing our internship on time.
NACHITHA Y K (4GH18EC025)
(iii)
TABLE OF CONTENTS
ABSTRACT ................................................................................................................................ i
ACKNOWLEDGEMENT ............................................................................................................. ii
CHAPTER 3 .................................................................................................................................. 3
Preamble ...................................................................................................................................... 3
Chapter 4 ........................................................................................................................................ 5
Chapter 5 ...................................................................................................................................... 15
Chapter 6 .................................................................................................................................... 20
(iii)
6.1 Applications ........................................................................................................................ 20
Conclusion ................................................................................................................................ 21
Reference ................................................................................................................................. 22
(iii)
Object Detection Using AI 2021-2022
CHAPTER 1
COMPANY PROFILE
Loginware Softtec Pvt. Ltd is an emerging startup established in the year 2016 and
based in Hassan, tier II city of Karnataka State. Loginware is a knowledge-driven company
that values cutting edge technology practices and provides comprehensive solutions to help our
customers achieve their goals. Loginware is changing the world by changing the way
knowledge can be shared. Loginware has the dedicated young minds striving to connect
individuals with each other and with technology. Loginware Sofftec Pvt. Ltd. is a proactive
player covering the full spectrum of software services, from design, development,
implementation, Validation, support and corporate training.
1.1 Vision
1.2 Mission
Bringing out the best in everyone we touch, motivate, inspire and empower each other
to do things they never thought were possible.
1.3 Services
Loginware is the one stop partner for all the technology needs of tier II and tier III cities.
An in-depth knowledge of various technology areas enables us to provide end to end
solutions and services. With our 'Web of Participation', we maximize the benefits of
our depth, diversity and delivery capability.
CHAPTER 2
2. People Proficiency
4. Internship Program
5. Project Guidance
7. Placement Support
CHAPTER 3
PREAMBLE
3.1 INTRODUCTION
Over the past years domains like image analysis and video analysis has gained a wide scope
of applications. CV and AI are two main technologies dominating technical society.
Technologies try to depict the biology of human. Human vision is the sense through which a
perception of outer 3D world is perceived. Human Intelligence is trained over years to
distinguish and process scene captured by eyes. These intuitions acts as a crux to budding
new technologies. Rich resource is now accelerating researchers to excavate more details
form the images These developments are due TO State of the-art methods like CNN.
Applications from Google, Facebook, Microsoft, and Snapchat are all results of tremendous
improvement in Computer vision and Deep learning. During time, the vision-based
technology has transformed from just a sensing modality to intelligent computing systems
which can understand the real world. Computer vision applications like vehicle navigation,
surveillance and autonomous robot navigation find Object detection and tracking as
important challenges. For tracking vehicles and other real word objects, video surveillance
is a dynamic environment. In this paper, efficient algorithm is designed for object detection
and tracking for video Surveillance in complex environment.
Object detection and tracking goes hand in hand for computer vision applications. Object
detection is identifying object or locating the instance of interest in-group of suspected frames.
Object tracking is identifying trajectory or path; object takes in the concurrent frames. Image
obtained from dataset is, collection of frames. Basic block diagram of object detection and
tracking is shown in Fig. 1. Data set is divided into two parts. 80 % of images in dataset are
used for training and 20 % for testing. Image is considered to find objects in it by using
algorithms CNN and YOLOv3. A bounding box is formed across object with Intersection over
union (IoU) > 0.5. Detected bounding box is sent as references for neural networks aiding them
to perform Tracking. Bounded box is tracked in concurrent frames using Multi Object Tracking
(MOT). Importance of this research work is used to estimate traffic density in traffic junctions,
in autonomous vehicles to detect various kinds of objects with varying illumination, smart city
development and intelligent transport systems [18]. Organization of paper is, Section II
identifies research gap through extensive literature survey. Section III covers Fundamental
Concepts of Object detection and Tracking. Section IV describes design, implementation
details and specifications. Section V discusses simulation results and analysis. Section VI
describes conclusions and future scope
3.1 METHODOLOGY
CNN is widely used neural network architecture for computer vision related tasks. Advantage
of CNN is that it automatically performs feature extraction on images i.e. important features
are detected by the network itself.
CNN is made up of three important components called Convolutional Layer, Pooling layer,
fully connected Layer as shown in Fig. 3.2. Considering a gray scale image of size 32*32
would have 1024 nodes in multi-layer approach. This process of flattening pixels loses spatial
positions of the image. Spatial relationship between picture elements is retained by learning
internal feature representation using small squares of input data.
The anchor box is regressed to the ground truth box by gradual optimization as shown in Fig.
4. Coordinate parameters are now defined as
( (1)
(2)
(3)
(4)
Where, are the predictions made by YOLO. is top left corner of grid cell of
the anchor. are the width and height of anchor. are predicted boundary box.
is box confidence score.
Predictions are made at 3 different scales as in Fig. 5. The initial prediction is made at last
feature map layer. Then feature map is up sampled by factor of 2. YOLOv3 merges feature
map with up sampled feature using element wise addition. Convolutional layer is applied to
obtain second predictions. Repeating second prediction will yield high semantic information.
Two stage algorithms from Region Proposal networks family of algorithms have two different
networks for proposing regions and extracting features. FPS of RCNN is 7, which is quite low
to handle real-time applications. One stage algorithm overcomes this drawback by employing
single shot detectors. Single Shot detectors face trade-off between accuracy and real-time
processing. The algorithm faces issues in identifying small objects or objects that are too close.
Though SSD networks are equally in boom as much as YOLO, algorithm might outperform
YOLO in terms of speed, but spatial resolution has dropped significantly and hence missing
out in locating small objects. Solution to challenge is increasing image resolutions. YOLO
family upgrades its accuracy, latency. YOLOv3 has DarkNet-53 has its backbone. The network
has less BFLOP (Billion floatingpoint operations) compared to residual Network-512. The
inclusion of Feature Pyramid network (FPN) helps in detecting objects that are small. FPN uses
both bottom-down and a topdown pathway. Bottom-up approach is used for feature extraction.
As we propagate through this approach, spatial resolution minimizes. Semantic value for each
layer increases. -----
C. Object Tracking
Internet is the main network connecting millions of people in world. Main entertainment factor
and the source of greater knowledge is video. Video is collection of frames. The negligible time
gap between frames makes the stream of photos looks like flow of scenes. When designing
algorithm for video processing. Videos are classified into two classes. Video stream is an
ongoing process for video analysis. The processor is not aware of future frames. Video
sequence is video of fixed length. All the consecutive frames are obtained prior to processing
of current frame. Motion is distinct factor that differentiates video form frame. Motion is a
powerful visual Que. Object properties and action can be realized by noticing only sparse points
in the image.
1) Track Handling and state estimation: The assignment problem maps prediction of
Kalman filter to that of newly arrived measurements. The task of associating two vectors is
performed by Hungarian algorithm. Adding additional information like motion and appearance
parameters in conjunction with association helps in better mappings.
(5)
Unlikely association can be removed by thresholding at 95% confidence interval. The decision
is given with an indicator.
(6)
When the motion uncertainty is large mahalanobis distance is not suitable, hence another metric
to aid in association. Metric computes appearance descriptor for each bounding box detection
dj.
(7)
(8)
CHAPTER 4
OVERVIEW OF THE PROJECT
4.1 BLOCK DIAGRAM
There is a wide range of computer vision tasks benefiting society such as object classification,
detection, tracking, counting, Semantic Segmentation, Captioning image, etc. Process of
identifying objects in an image and finding its position is known as object detection.
Various object detection tasks as shown in Fig. 2. With advancements in field of computer
vision assisted by AI, realization of tasks was realizable along t time scale. Semantic
segmentation task of clustering pixels based on similarities. Classification + Localization and
object detection method of identifying class of object and drawing a bounding box around it to
make it distinct. Instance segmentation is semantic segmentation applied to multi objects. The
general intuition to perform the task is to apply CNN over the image. CNN works on image
patches to carry out the task many such salient regions can be obtained by Region-Proposal
Networks like Region Convolution Neural network (RCNN), Fast- Region
Convolutional Neural Network (Fast-RCNN), Faster- Region Convolutional Neural Network
(Faster-RCNN). To perform selective search for object recognition Hierarchal Grouping
Algorithm is used. Few bottlenecks by these approaches are mitigated by state-of the-art
algorithms like You Only Look Once (YOLO), Single shot Detector (SSD). The efficient object
detection algorithm is one which assures to give bounding box to all objects of vivid size to be
recognized, with great computational capabilities, faster processing. YOLO and SSD assure to
render promising results, but have a tradeoff between speed and accuracy. Hence, selection of
algorithm is application specific.
This algorithm can be described as supervised classification algorithm. Data flows through
CNN layers and various operations are performed on data. The learning rate and callbacks are
defined. Number of epochs and batch size is also defined. The epochs are then executed through
which algorithm learns through training data. Training accuracy and training losses are
constantly monitored. If training accuracy starts falling below a threshold, the callback function
is invoked and epochs are stopped. Confusion matrix is then plotted using training and testing
data. Various performance parameters can be defined and observed using the confusion matrix.
Object detection pipeline has one component for generating proposals for classification.
Proposals are nothing but candidate regions for object of interest. Most of approaches employ
a sliding window over feature map and assigns foreground/background scores depending on
features computed in that window. The neighborhood windows have similar scores to some
extent and are considered as candidate regions. This leads to hundreds of proposals. As the
proposal generation method should have high recall, we keep loose constraints in this stage.
However, processing these many proposals all through the classification network is
cumbersome. This leads to a technique, which filters proposals based on some criteria called
Non-Maximum Suppression. IOU calculation is actually used to measure the overlap between
two proposals.
Number of classifiers 80
5 (Car, Bus, Truck,
Classifiers used Motor Cycle and
Train)
Total number of Input 11682
Images
Training images 9736
Testing images 1946
CHAPTER 5
SIMULATION RESULTS AND ANALYSIS
This section describes simulation results and performance parameters observed are accuracy,
precision and recall. It also underlines the confusion matrices of different datasets and
convolution layers of the algorithms.
5.1 SINGLE OBJECT DETECTION
CNN is designed for single object detection. The layers and each layer information as shown
in Fig below.
It encompasses the parameters that were included in each step, layer progression and output
image size of every layer. Each layer divides the image matrix into its components and performs
an operation on image. The output image size of various layers is different due to manipulations
by each layer such as initially the output image size is 28×28 which then reduces to 14×14 due
to the max pooling layer which chooses the max valued pixel from the surrounding pixels. It
then reduces to 7×7 due to the second max pooling layer. This pixel is then flattened into
7×7×64 which are 3136 sized vector. This vector is then reduced to a less sized vector by
proceeding layers and final calculation parameters are displayed.
Designed neural network was trained and tested. Obtained training accuracy and loss as shown
in Figures . Obtained 82% training accuracy through training this model. The loss and accuracy
are inversely proportional to each other. As the number of epochs increases, learning rate
increases and hence loss decreases. Each time epochs is run, the model trains it itself and
weights of the convolution networks gets updated to a more accurate value.
The CNN is successfully able to classify the given object as truck and car with an accuracy of
75.68% and 84.409% respectively as shown in Fig below.
Upon simulation, it is able to correctly classify the vehicles by classifying a car with 79.853%
accuracy and about 78.122% accuracy for the detection of an auto as shown in Fig below.
Upon simulation, it is able to correctly classify the vehicles by classifying it into a car with
about 79.036 % accuracy and auto with about 80.064 % accuracy as shown in Fig below.
--
Confusion matrix for day images is tabulated in Table. The performance parameters are
extracted from confusion matrix and tabulated in Table. Accuracy, precision and recall data
is evident for autos, cars and heavy type of vehicles as shown in Fig. The accuracy of autos
and cars is almost identical while that of heavy vehicles is slightly better than that of others.
Since the number of training images is more for the Day images, the results obtained are better
than that of Evening and Night Dataset images. High precision indicates that, the algorithm
returned substantially more relevant results than irrelevant ones while high recall means that
an algorithm returned most of the relevant results.
5.2 RESULT
CHAPTER 6
6.1 APPLICATIONS
3. Autonomous Driving.
1. Image processing techniques generally don't require historical data for training and are
unsupervised in nature.
Pro's: Hence, those tasks do not require annotated images, where humans labeled data
manually (for supervised training).
Con's: These techniques are restricted to multiple factors, such as complex scenarios
(without unicolor background), occlusion (partially hidden objects), illumination and
shadows, and clutter effect.
Pro's: Deep learning object detection is significantly more robust to occlusion, complex
scenes, and challenging illumination.
Con's: A huge amount of training data is required; the process of image annotation is
labor-intensive and expensive. For example, labeling 500'000 images to train a custom DL
object detection algorithm is considered a small dataset. However, many benchmark datasets
(MS COCO, Caltech, KITTI, PASCAL VOC, V5) provide the availability of labeled data.
CONCLUSION
The inclusion of Artificial Intelligence to solve Computer vision tasks has outperformed the
image processing approaches of handling the tasks. The CNN model trained to on road vehicle
dataset for single object detection, achieved a validation accuracy of 95.7 % for auto, 95.5%
for car and 96 % for heavy vehicles for day images. The high validation accuracy is because of
huge amount of data on which it is trained from each class. Performance metrics are tabulated
for day, evening and NIR images. Multiple object detection is implemented using YOLOv3 for
KITTI and COCO dataset. Performance metrics is tabulated for YOLOv3 on considered classes
of images. Higher the precession value of class greater will be mAP value. The mAP value
depends on image chosen for calculation. IoU of 0.5 is ideal for detection and tracking. mAP
values can be enhanced by increasing true positive values. Results of performance metrics is
totally dependent on image data set used. Further objects are detected in video based on region
of interest. The performance measures measured such as speed and color of vehicle, type of
vehicle, direction of vehicle movement and the number of vehicles in ROI. Multiple object
tracking is implemented for traffic surveillance video using YOLOv3 and OpenCV. Multiple
objects are detected and tracked on different frames of a video. Further training the models on
powerful GPUs and by increasing the number of images evaluate the models on other datasets
and modify the design if required to make the model more robust and suitable for real-time
applications.
REFERENCE
• V. D. Nguyen et all., “Learning Framework for Robust Obstacle Detection,
Recognition, and Tracking”, IEEE Transactions on Intelligent Transportation Systems,
vol. 18, no. 6, pp. 1633-1646, June 2017
• Zahraa Kain et all, “Detecting Abnormal Events in University Areas”, 2018
International conference on Computer and Applications(ICCA),pp. 260-264, 2018.
• P. Wang et all., “Detection of unwanted traffic congestion based on existing
surveillance system using in freeway via a CNN-architecture trafficnet”, IEEE
Conference on Industrial Electronics and Applications (ICIEA), Wuhan, 2018, pp.
1134-1139.
• Q. Mu, Y. Wei, Y. Liu and Z. Li, “The Research of Target Tracking Algorithm Based
on an Improved PCANet”, 10th International Conference on Intelligent Human-
Machine Systems and Cybernetics (IHMSC), Hangzhou, 2018, pp. 195-199.
• H. C. Baykara et all., “Real-Time Detection, Tracking and Classification of Multiple
Moving Objects in UAV Videos”, 29th IEEE International Conference on Tools with
Artificial Intelligence (ICTAI), Boston, MA, 2017, pp. 945-950.
• W. Wang, M. Shi and W. Li, “Object Tracking with Shallow Convolution Feature”, 9th
International Conference on Intelligent Human-Machine Systems and Cybernetics
(IHMSC), Hangzhou, 2017, pp. 97-100.
• K. Muhammad et all., “Convolutional Neural Networks Based Fire Detection in
Surveillance Videos”, IEEE Access, vol. 6, pp. 1817418183, 2018.
• D. E. Hernandez et all., “Cell Tracking with Deep Learning and the Viterbi Algorithm”,
International Conference on Manipulation, Automation and Robotics at Small Scales
(MARSS), Nagoya, 2018, pp. 1-6.