Object Detection and Recognition Using YOLO - Detect and Recognize
Object Detection and Recognition Using YOLO - Detect and Recognize
Abstract
The world in the 21st century is ever evolving towards automation. This upsurge seemingly has
no decline in the foreseeable future. Image recognition is at the forefront of this charge which
seeks to revolutionize the way of living of the average man. If robotics can be likened to the cre-
ation of a body for computers to live in, then image processing is the development of the part of
its brain which deal with identification and recognition of images.
To accomplish this task, we developed an object detection algorithm using YOLO, and acronym
for “You Only Look Once”. Our algorithm was trained on fifty thousand images and evaluated
on ten thousand images and employed a 21 x 21 grid. We also programmed a text generator
which randomly creates texts and URLs in an image. A record of useful information about the
location of the URLs in the image is also recorded and later passed to the YOLO algorithm for
training.
At the end of this project, we observed significant difference in the accuracy of URL detection
when using an OCR software or our YOLO algorithm. However, our algorithm would be best
used to specify the region of interest before converting to texts which greatly improves accuracy
when combined with OCR software.
Keywords: Object Detection; Image Recognition; OCR; AI; World Wide Web; YOLO.
3
Acknowledgements
I would like to express my deepest appreciation to God for life. I would also like to ex-
tend my deepest gratitude to all the Professors at St. Cloud State University, who helped make
this academic journey worthwhile. Learning Software Development concepts, Database con-
cepts, Data Structure and Algorithm concepts, Machine Learning concepts, and other important
I would like to thank my friends, Alienyi Daniel, Ojo Ore, Egwu Benedict, Nwachi Eurel,
et al. for their constant support and encouragement. Finally, and most importantly, I want to
thank my family for believing in me and constantly encouraging me throughout this journey.
4
Table of Contents
Page
Chapter
I. Introduction ..................................................................................................................... 8
Overview ................................................................................................................. 8
Motivation ............................................................................................................... 8
Overview ............................................................................................................... 17
Chapter Page
Pytesseract............................................................................................................. 39
Regular Expressions.............................................................................................. 41
Overview ............................................................................................................... 43
Data Preparation.................................................................................................... 43
Training ................................................................................................................. 45
Evaluation Metric.................................................................................................. 47
Overview ............................................................................................................... 49
V. Conclusion..................................................................................................................... 55
Overview ............................................................................................................... 55
Application ............................................................................................................ 55
References ................................................................................................................................. 59
6
List of Figures
Figure Page
5. Activation functions. (1) the left curve is a sigmoid function curve. (2) the right curve is a
6. A sample fully connected neural network with only one hidden layer [35]. .................... 25
15. Word bank used to generate our training image data. ...................................................... 44
16. Sample generated image using our custom-built tagging tool. ......................................... 44
17. Sample grid generated by YOLO for our detection algorithm. ........................................ 45
Figure Page
20. The higher the IoU, the better the performance [24]. ....................................................... 48
21. (a) Object detection algorithm after 10 epochs (b) Object detection algorithm for an
image without any link (c) Improved Object detection algorithm after 200 epochs. ....... 49
22. Accuracy between two Optical Character Recognition techniques and our model on ten
25. OCR image conversion to text versus our object detection model. .................................. 52
27. Detection Accuracy between Optical Character Recognition techniques and our model on
Chapter I: Introduction
Overview
In this chapter, we introduce the background and focus area of our thesis. We highlighted
our motivations, scope, and thesis organization. We will explore YOLO’s potential in recogniz-
Motivation
is however far from perfection. Seemingly simple scenarios, such as object detection, face recog-
nition, removing motion blur, etc. and more complex scenarios such as compression artefacts,
scratch detection, sensor noise, and spilling detection are applications of image recognition/im-
age processing.
Digitized images are often represented as a two-dimensional (2D) array of pixels values.
Each pixel value which makes up the color scheme of the image is often influenced by an array
of factors such as light intensity. Visual scene is projected unto a surface, where receptors (natu-
ral or artificial) produce values that depend on the intensity of incident light.
These exciting concepts are however hard to implement. Forming an image leads to loss
sional image. Many other factors are responsible for why image recognition/ image processing is
hard. Some of such factors are noise in the image (pixels values that are off from its surrounding
In recent years, during the ImageNet Large Scale Visual Recognition Competition
(ILSVRC, 2015), computers were going better than humans in the image classification task [35].
9
In 2016, a faster object detector, YOLO, was proposed to implement object detection in real-time
situation. Our motivation is to apply YOLO to object detection task of URL links within an im-
age scene. We will also be comparing the speed and accuracy of this with an OCR software.
Research aim
The research aim of this thesis is design and implement an AI algorithm to detect URL
links within image scenes using YOLO. Subsequently, we will use an Optical Character Recog-
nition (OCR) software to convert these images to text and use regular expression to search for
links within the text. Both approaches will be compared for accuracy. We also proposed a few
The first research question of this study is to verify whether YOLO is a good model for
general object detection or not. Furthermore, given the nature of URLs and the fact that this class
of classification does not yet exist, whether YOLO is a potential candidate for URL detection
will be considered.
The second research question is to evaluate the accuracy of our YOLO detection model
An image (from Latin: imago), is an artefact that depicts visual perception, such as a
photograph or other two-dimensional picture, that resembles a subject - usually a physical object
- and thus provides a depiction of it [1]. An image is a visual representation of something [2]. Ac-
b: the optical counterpart of an object produced by an optical device (such as a lens or mirror)
While the orthodox idea of an image is that it usually only two dimensional and thus must
appear on a screen, the technical visage of the word suggests that an image can also be three-di-
three-dimensional, such as statues or holograms. They may be captured by optical devices – such
as cameras, mirrors, lenses, telescopes, microscopes, etc. and natural objects and phenomena,
More interestingly perhaps, is that the term image lends itself to more than physical ap-
plications. The term also lends itself to mental and metaphysical uses. As such, an image can be
interpreted to mean, a mental image - an impression created in the imagination of a person upon
the suggestion of a particular concept, and a divine image- an image believed to be transmitted
into the human psyche by a supernatural entity. An image may also be classified as being either
fixed or volatile. A volatile image is one that exists only for a short period of time. This may re-
ray tube. A fixed image, also called a hard copy, is one that has been recorded on a material ob-
ject, such as paper or textile by photography or any other digital process [4].
11
The term ‘image’ in Computer Science is employed in a more technical scope. It is both
restricted as well as specific in its application. Image processing for instance is a set of computa-
tional techniques for analyzing, enhancing, compressing, and reconstructing images. Its main
components are importing, in which an image is captured through scanning or digital photog-
raphy; analysis and manipulation of the image [5], In information technology, the term has sev-
eral usages.
Firstly, it may be employed in its base form, in the way, an image is a picture that has
been made or copied and stored in electronic form. An image can also be described in terms of
vector graphics or raster graphics. An image stored in raster form is sometimes called a bitmap.
Bitmap images are made from small parts called pixels. Vector images are made using coordi-
nates and geometry. Images can be compressed to reduce file size [6].
standard set by the Joint Photographic Experts Group, an ISO/IEC group of experts that
develops and maintains standards for a suite of compression algorithms for computer im-
2. GIF (pronounced JIF by many, including its designer; pronounced GIF with a hard G
by many others) stands for Graphics Interchange Format. The GIF employs the use of 2D
raster data type and is encoded in binary. GIF files ordinarily have the .gif exten-
sion.GIF89a is an animated GIF image, formatted according to GIF Version 89a. One of
the merits of this format is that it possesses the ability to create an animated image that
can be played after transmitting to a viewer page that moves - for example, a twirling
12
icon or a banner with a hand that waves or letters that magically get larger. A GIF89a can
3. PNG (pronounced ping) is an acronym for Portable Network Graphics. It is a file for-
mat used for image compression and was designed to provide a number of enhancements
over the GIF format. Like a GIF, a PNG file is compressed in lossless fashion, this means
that all image information is restored when the file is decompressed during viewing. Files
XML. Any program such as a browser that recognizes XML can display the image using
the information provided in the SVG format. Scalability means that the file can be
viewed on a computer display of any size and resolution, whether the small screen of a
smartphone or a large widescreen display in a PC. Files usually have .svg extension.
5. TIFF (Tag Image File Format) is a common format for exchanging raster graphics
(bitmap) images between application programs, including those used for scanner images.
A TIFF file can be identified as a file with a .tiff or ".tiff" file name suffix.
Furthermore, the term is also often used to mean a disk image, a disk image is a copy of
the entire contents of a storage device, such as a hard drive or DVD. The disk image represents
the content exactly as it is on the original storage device, including both data and structure infor-
mation.
Lastly, another use of the term image is for a section of random-access memory (RAM)
Object detection is a computer vision implementation that makes a system (an algorithm)
about to estimate the location of objects in a digitized scene such as an image or video. Usually, a
bounded box is wrapped around the detected object which helps humans locate the object
quicker than unprocessed images. For this discourse, an object is the representation of a physical
object (URL) in an image. In image processing, it is an identifiable portion of an image that can
be interpreted as a single unit [7]. This creates a sharp contrast to the layman’s idea that an image
Usually, an image may contain one or more objects, the discernibility of which is of up-
most importance. For instance, in a single image the objects contained can range from a single
Although “detection” could mean locating a hidden concealed object, detection may also
mean the ability of an intelligence to signify the existence and identification of an object. The ob-
ject in question does not have to be hidden. This later form is the form in which we based this
thesis.
Interpreting the object localization for object detection can be done in various ways, in-
cluding creating a bounding box around the object or marking every pixel in the image which
contains the object (called segmentation) [8]. Thus, given an image or video stream, an object
detection model should be able to identify which of a known set of objects might be present and
Object Detection is widely employed in various computer vision tasks such as image an-
notation, activity recognition, face detection, face recognition, video object co-segmentation. It is
14
also used in tracking objects. For example, tracking a ball during a football match, tracking
movement of a cricket bat, or tracking a person in a video are some of its various uses [10].
Fundamentally, two approaches to image detection exist, they are machine learning-based
approaches and deep learning-based approaches [11]. In more traditional ML-based approaches,
computer vision techniques are employed to analyze various features of an image, such as the
color histogram or edges, to identify groups of pixels that may belong to an object. These fea-
tures are then inputted into a regression model that predicts the location of the object along with
Some Machine learning approaches are Viola–Jones object detection framework based on
Haar features, Scale-invariant feature transform (SIFT) and Histogram of oriented gradients
(HOG) features [10]. On the other hand, deep learning-based approaches employ convolutional
neural networks (CNNs) to perform end-to-end, unsupervised object detection, in which features
do not need to be defined and extracted separately [11]. Some Deep learning approaches are: Re-
gion Proposals (R-CNN,Fast R-CNN,Faster R-CNN, cascade R-CNN.), Single Shot MultiBox
Detector (SSD), You Only Look Once (YOLO), Single-Shot Refinement Neural Network for
Object detection is often confused with image recognition (Figure 1). A picture of a dog
receives the label “dog”. A picture of two dogs, still receives the label “dog”. Object detection,
on the other hand, draws a box around each dog and labels the box “dog”. The model predicts
where each object is and what label should be applied. In that way, object detection provides
based on certain similarities that it shares with another object that the intelligence has previously
encountered. Recognition may be based on inference or relation, that is, a situation whereby an
intelligence is able to recognize an object because it recognizes similarities in form and proper-
ties. Recognition may also occur because the Artificial intelligence has encountered the exact
In human beings’ recognition is a cognitive process that happens seamlessly and almost
instantly without any hitch. The human brain is capable of learning and adapting information
with minimal effort such that even humans that are still in the developmental stages of their ex-
istence can recognize objects and patterns easily. Humans can recognize a multitude of objects in
images with little effort, even though the image of the objects may vary somewhat in different
viewpoints, in many different sizes and scales or even when they are translated or rotated. Ob-
jects can even be recognized when they are partially obstructed from view [12]. Artificial intelli-
gence, however, does not innately possess this cognitive ability. For Artificial Intelligence to ac-
quire this level of skill they must acquire training. This training is usually acquired by ‘teaching’
the A.I. using coding, datasets, and databases. This task is still a challenge for computer vision
systems given these A.I. systems need to be trained for each class of object it is meant to recog-
nize.
The object recognition problem can be defined as a labelling problem based on models of
known objects. Formally, given an image containing one or more objects of interest (and back-
ground) and a set of labels corresponding to a set of models known to the system, the system
should be able to accurately assign correct labels to regions, or a set of regions, in the image. The
16
object recognition problem is closely tied to the segmentation problem: without at least a partial
recognition of objects, segmentation cannot be done, and without segmentation, object recogni-
that each object in the world can cast an infinite number of different 2-D images onto the retina
as the object's position, pose, lighting, and background vary relative to the viewer. Yet the brain
solves this problem effortlessly. Progress in understanding the brain's solution to object recogni-
tion requires the construction of artificial recognition systems that ultimately aim to emulate our
Overview
In this Chapter, we will review the basic concepts of general neural networks, explain a
subclass of neural network, Convolutional Neural Network (CNN), Batch Normalization, Max
Pooling, Activation Function, and Rectified Linear Unit (RELU). We will also describe a specific
CNN based object detection model, YOLO, which is the most accurate real-time object detector
[35].
The neural networks described in this thesis are the feed-forward neural networks, in
Image Processing
Image processing is a technique that is used to detect objects/patters and perform some
operations on an image usually with the aim to retrieve value information from the image or en-
hance it. The processing of digital images can be divided into several classes: image enhance-
Image Enhancement: This is a technique used to adjust digital images for better suitable
displays or analysis. An image can be enhanced using median filtering, linear contrast adjust-
can be blurred due to many factors like relative motion the camera and the object. Noises are of-
ten caused by environmental conditions such as rain, snow, etc. or even by thermal signals [36].
Image Analysis: Images that contain useful information can be used to outline objects
and describe them. Some examples of image analysis are edge extraction, image segmentation,
Image Compression: When an image is compressed, the image byte is minimized with-
out degrading the quality of the image. The advantage of compression is more images can be
The term artificial intelligence was first used in 1956, at a computer science conference
in Dartmouth. It described an attempt to digitally model how the human mind worked and based
on this knowledge create more advanced computers. Even though some truly groundbreaking
progress has been made, today, artificial intelligence is ubiquitous, but computers are still far
from modelling human intelligence to perfection. The goal of the “Strong” AI is to become artifi-
cial persons: machines that have all the mental powers we have, including phenomenal con-
sciousness. “Weak” AI, on the other hand, seeks to build information-processing machines that
appear to have the full mental repertoire of human persons (Searle 1997). Weak, or narrow AI on
the other hand is created to be quite adept at carrying out the purpose that it was defined for, but
it will not pass for human in any other field outside of its defined capacities. An example of this
is Deep Blue, the first computer to defeat a human in chess. It was able to defeat the famous
To facilitate this growth and development in the field, machines must learn how to be-
have and this process if facilitated through a system known as machine learning.
Machine Learning
Machine learning is an area within Artificial Intelligence (AI), that focuses on decision
making and predictions. The primary aim is to allow the computer to be further developed with-
out any human intervention, it will be trained based on observations and data [36]. Machine
learning algorithms are often categorized as supervised or unsupervised, where supervised algo-
rithms apply what has been learned in the past to new data using labeled examples. This type of
problem can be either predicting a continuous quantity (Regression) or discrete class labels
structures or patterns (Clustering) or finding relationships between variables in big data (Associ-
ation).
Machine learning is a subset of the larger field of Artificial Intelligence that “focuses on
teaching computers how to learn without the need to be programmed for specific tasks [15],” “In
fact, the key idea behind ML (sic) is that it is possible to create algorithms that learn from and
make predictions on data [15].” To educate a machine. Three important components must be pre-
sent, namely, a dataset, features, and an algorithm. These components will be discussed serially
in subsequent paragraphs.
A dataset is a pool of samples compiled with the intention of educating a machine on how
to go about a task. These pools can sometimes be numbers, images, texts, or any other kind of
data. It can sometimes be quite strenuous and expensive to compile a data set depending on how
20
specialized the task is expected to be. For instance, a data set of exclusive Rolex watches would
Features are important pieces of data that act as the key to the solution of the task. They
show the machine what to look for in an image. The way features are defined vary based on what
the machine is expected to look out for. For instance, to predict the price of an apartment, it
would prove to be too difficult to prove by way of linear regression the cost of the property based
on the combination of its length and width However, it may be easier to predict a price based on
the correlation between the price and location of a building. Therefore, there is no gainsaying the
fact that the accuracy of the features provided affects the eventual performance of the machine.
Which is what occurs when the machine is trained with labeled data which contain the “right so-
lutions”, and a validation set. During the learning process, the program is expected to get the
“right” solution. And then, the validation set is used to tune hyperparameters to avoid overfitting.
However, in unsupervised learning, features are learned with unlabeled input data. In his case the
machine is not given any features to familiarize itself with, it learns to notice the patterns by it-
self.
typically to solve a class of problems or to perform a computation [16]. Algorithms are always
unambiguous and are used as specifications for performing calculations, data processing, auto-
mated reasoning, and other tasks [17]. The instructions describe a computation that, when exe-
cuted, proceeds through a finite number of well-defined successive states, eventually producing
“output”.
21
Evaluating a machine learning model requires that our dataset be split appropriately. It is
typically to have 70% training dataset and 30% for testing. However, it is also advisable to have
about 20% validation dataset to make it possible to evaluate the model while still building and
tuning the model. This helps to avoid overfitting and discourages using the training dataset for
testing. After shuffling the data, a split of 60% training dataset, 20% validation dataset, and 20%
Speed, precision, recall, and accuracy are important metrics of a machine learning algo-
rithm to be evaluated. The accuracy of an object detection model is calculated using the Mean
Accuracy is the ratio of correct predictions to the total number of input samples.
correct predictions
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
all predictions
Precision is the ratio of relevant examples (true positives) among all the examples which
true positives
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
true positives + false positives
Recall is the ratio of examples which were predicted to belong to a class with respect to
true positives
𝑟𝑒𝑐𝑎𝑙𝑙 =
true positives + false negatives
True Positive (TP) [36]: The prediction that was made is true, and the actual output is
positive.
22
False Positive (FP) [36]: The prediction that was made is false, and the actual output is
positive.
False Negatives (FN) [36]: The prediction that was made is false, and the actual output is
negative.
Deep Learning
Apart from the machine learning earlier discussed there are new evolved specialized
forms of learning which have grown apart and form a subset of machine learning known as Deep
Learning. Deep learning is a more effective branch of machine learning based off the perceived
structure of the human brain, Deep learning algorithms use complex multi-layered neural net-
works, where the level of abstraction increases gradually by non-linear transformations of input
data.
In a neural network, information moves from one layer to another over a set of over-con-
necting channels. These channels are called weighted channels due to the amount of value at-
tached to them.
All neurons possess a unique number known as bias. This bias combined with the
weighted sum of inputs reaching the neuron is then applied to the activation function. The result
of the function determines if the neuron gets activated. Every activated neuron passes on infor-
mation to the following layers. This continues up to the second last layer. The output layer in an
artificial neural network is the last layer that produces outputs for the program [18]. The field of
deep learning is very useful in fields such as object recognition as well as voice recognition.
In 2013, R-CNN was the first algorithm to apply deep learning to Object detection. It out-
performed the previous algorithms by up to 30% on the Visual Object Classes Challenge
23
(VOC2012). This affirmed its place as the frontrunner of the object detection industry. It beats
the previous ones by more than 30% on the VOC2012 and was therefore a huge improvement in
Object Detection involves two difficulties: locating objects and suppling the appropriate
classifications.
Neural Network
Neural network is a computer system modeled on the human brain and the nervous sys-
neural network, composed of artificial neurons or nodes. Thus, a neural network is either a bio-
logical neural network, made up of real biological neurons, or an artificial neural network, for
Neuron
Neuron, in biology, is the basic working unit of the brain, an electrically excitable cell
that communicates with other cells via specialized connections called synapses. A neuron is the
atomic element of a neural network. A perceptron is the mathematical model of a biological neu-
values to represent the total strength of the input signals and applying a step function as the acti-
vation function on the sum to output the result. The output generated is fed into other perceptron
and the final output is bounded between 0 and 1. A multi-layered perceptron is called neural net-
works.
Figure 5. Activation functions. (1) the left curve is a sigmoid function curve. (2) the right curve
is a tanh function curve [35].
Structure of Neural Networks
Neural networks typically consist of multiple layers and each layer has many neurons as
shown in Figure 6. The first layer is known as the input layer where each circle represents an in-
put neuron. The rightmost layer is the output layer and contains output neuron(s), and all the
25
middle layers are hidden layers. General neural networks have fully connected neurons and neu-
rons between adjacent layers are connected for passing information. For example, if two adjacent
layers have 𝑚 and 𝑛 neurons respectively, the total number of connections will be 𝑚 × 𝑛 [35].
Figure 6. A sample fully connected neural network with only one hidden layer [35].
Cost Function and Training
Training involves fine-tuning a bad performing neural network (due to randomly initial-
ized weights) into a network with high accuracy. Training is equivalent to minimizing loss func-
tion. Neural networks learn from the errors between the predictions and ground truths by updat-
ing its parameters (weights) [35]. A cost function is a measure of error between the predicted
value of a model and what the actual value is. The error range continues to shrink during the
training process until the value stays constant which indicates the training process is complete.
Figure 7 shows a visual representation of a training process. The bowl-shaped error surface indi-
cates the error sliding from the edge to the bottom (where the minimum error is) [35].
26
Convolutional Neural Network (CNN) first surfaced around 1998 [35] but was proved to
be an efficient tool for image classification in 2012 at the Large Scale Visual Recognition Chal-
lenge. CNN is a deep learning algorithm which can take an input image, assign importance
(learnable weights and biases) to various aspects/objects in the image and be able to differentiate
one from the other. The pre-processing required in a CNN is much lower as compared to other
classification algorithms. While in primitive methods, filters are hand-engineered, with enough
Pooling
After each convoluted operation, pooling operation follows to further simplify the infor-
mation. The pooling process simply compresses previous feature maps into condensed feature
maps. The two most common pooling methods are average-pooling and max-pooling. Average-
pooling means to output the average value among the pooling filed, and max-pooling is to select
the maximum value. Figure 8 shows a max pooling operation with a 2 × 2 pooling filter. The
size of the feature map is reduced from 24 × 24 to 12 × 12. We will be using max-pooling for
Figure 9 is an object detection structure which has a region proposal component followed
by a CNN classifier. Researchers use region proposal methods to produce a bunch of candidate
regions, each of which may contain one kind of object. Each region is then passed through the
28
CNN classification algorithm. The model is to convert a multiple object detection problem into a
single object classification problem. However, these region proposal methods are much slower
than classification part, which becomes the bottleneck for the whole system. The drawback of
this structure is that we cannot tradeoff accuracy for detection speed in time critical applications
[35].
The YOLO model was first brought into existence by Joseph Redmon in his paper “You
only look once, Unified, Real-time object detection”. The mechanism for the algorithm employs
the use of a single neural network that takes a photograph as an Input and attempts to predict
bounding boxes and class labels for each bounding box directly. Although this offered less pre-
dictive accuracy, which was mostly due to more localization errors, it boasted speeds of up to 45
frames per second and up to 155 frames person on speed optimized versions of the model [19].
“Our unified architecture is extremely fast. Our base YOLO model processes images in
real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an
To begin with the model operates by splitting the inputted image into a grid of cells,
where each cell is responsible for predicting a bounding box if the center of a bounding box falls
within it. Each grid cell predicts a bounding box involving the x, y coordinate and the width and
height and a metric of valuation of quality known as a confidence score. A class prediction is also
To supply more emphasis an instance will be provided. For example, an image may be
divided into a 7 × 7 grid and each cell in the grid may predict 2 bounding boxes, resulting in 94
proposed bounding box predictions. The class probabilities map and the bounding boxes with
confidences are then combined into a final set of bounding boxes and class labels.
The YOLO was not without shortcomings, the algorithm had a number of limitations be-
cause of the number of grids that it could run on as well as some other issues which will be ad-
dressed subsequently. Firstly, the model uses a 7 × 7 grid and since each grid can only identify
an object, the model restricts the maximum number of objects detectable to 49. Secondly, the
model suffers from what is known as a close detection model, since each grid is only capable of
detecting one object, if a grid cell contains more than one object it will be unable to detect it.
Thirdly, a problem might arise because the location of an object might be more than a grid, thus,
there exists a possibility that the model might detect the object more than once [20]. Due to the
aforementioned problems encountered when running YOLO, it was fairly obvious that localiza-
tion error and other problems of the system needed to be addressed. As a result of that, YOLOv2
was created as an improvement to deal with the issues and questions posed by its predecessor.
Therefore, localization errors as well as errors of real were significantly addressed in the new
30
version. The model was updated by Joseph Redmon and Ali Farhadi to further revamp model
performance in their 2016 paper named “YOLO9000: Better, Faster, Stronger [19]”.
Structure of YOLO
YOLO is implemented as a convolution neural network and has been evaluated on the
fully connected layers. The layers are separated by their functionality in the following manner:
224 × 224.
4. Last 4 convolutional layers followed by 2 fully connected layers are added to train
5. Object detection requires more granular detail hence the resolution of the dataset
6. The final layer predicts the class probabilities and bounding boxes.
31
ReLU activation.
The input is 448 × 448 image, and the output is the class prediction of the object en-
Network Structure
The whole system can be divided into two major components: Feature Extractor and De-
tector; both are multi-scale. When a new image comes in, it goes through the feature extractor
32
first so that we can obtain feature embeddings at three (or more) different scales. Then, these fea-
tures are feed into three (or more) branches of the detector to get bounding boxes and class infor-
mation [42].
YOLOs default metric for measuring overlap between two bounding boxes or masks is
Intersection over union (𝐼𝑜𝑈). Any algorithm that provides predicted bounding boxes as output
can be evaluated using 𝐼𝑜𝑈 [20]. If the prediction is completely correct, 𝐼𝑜𝑈 = 1. The lower the
To apply Intersection over union, certain parameters must be met to evaluate an (arbi-
trary) object detector, they are: The ground-truth bounding boxes these are the hand labeled
bounding boxes from the testing set that specify where in the image our object is and the pre-
dicted bounding boxes from our model. If these two sets of bounding boxes are present it is pos-
sible to apply Intersection over union. Thus, computing Intersection over Union can therefore be
calculated as; The area of overlap divided by the area of union [22].
Plainly put, the intersection over union is a ratio of the similarity of the ground truth
bounding box and the predicted bounding box, thus predicting a rough estimate on how much an
Artificial intelligence can rely on the predictions made by the algorithm. This is an improvement
over binary models which label predictions as either correct or incorrect. Also due to the variance
in parameters between the model and the object it is quite unrealistic to have a 100% match be-
tween the (𝑥, 𝑦) coordinates a predicted bounding box and the (𝑥, 𝑦) coordinates of the ground
truth box [22]. Thus, this equation ensures that boxes with a larger area of overlap get higher
scores than those with lesser areas thus cementing Intersection over union as excellent metric for
33
evaluating custom object detectors. Generally, any score greater than 0.8 is a good score [23].
This application of intersection over union is used during the testing phase at the completion of
Another interesting application of this model occurs where there is more than one bound-
ing box for the same object. the metric helps to eliminate bounding boxes with lesser scores.
How this is done is that if there are two bounding boxes with very high confidence scores, then
the chances are that the boxes are detecting the same object therefore the box with the higher
confidence rating is selected. However, where confidence rating of the two boxes is low, it is
likely that they are predicting separate objects of the same class, such as two different cars in the
same picture. This application is used after the model has completed training and is being de-
ployed [24].
As earlier mentioned, Intersection over union is a good metric for evaluating the quality
of a task. Before further elucidation, it is pertinent to define some terminologies first, these ter-
minologies are true positive, true negatives, false negatives, and false positives. Simply put, a
true positive is any correctly drawn annotation with a value greater than 0.5, a true negative is
when an F1 refuses to draw any annotation because there simply is not one to be drawn. There is
no value here because no annotation is drawn, thus there is no way to calculate true negatives. A
false negative occurs where there are missing annotations while a false positive occurs where
there are incorrectly drawn annotations that have an 𝐼𝑜𝑈 score of less than 0.5. An equation
known as accuracy is usually used to measure the performance of a task because it is an incredi-
bly straight forward measurement as well as for its simplicity. it is simply a ratio of correctly
drawn annotations to the total expected annotations (ground truth). Thus, it is calculated as being
34
equal to the sum of True positive and True negative divided by the sum of True positive, False
In the early versions of the YOLO model, there were no constraints on the location pre-
diction. The predicted bounding box was not tethered therefore it could occur far from the origi-
nal grid location. This anomaly resulted in a very unstable model [25]. The bulk of the instability
resulted from predicting the (𝑥, 𝑦) locations for the box. In region proposal networks the network
predicts values 𝑡𝑥 and 𝑡𝑦 and the (𝑥, 𝑦) center coordinates are calculated as:
𝑥 = (𝑡𝑥 ∗ 𝑤𝑎 ) − 𝑥𝑎
𝑦 = (𝑡𝑦 ∗ ℎ𝑎 ) − 𝑦𝑎
For example, a prediction of tx = 1 would shift the box to the right by the width of the an-
chor box, a prediction of 𝑡𝑥 = −1 would shift it to the left by the same amount [25]. This for-
mulation did not exist within any boundaries and as such, any anchor box could end up at any
point in the image regardless of what location predicted the box. with random initialization it
took the model an obscene amount of time to stabilize to predict sensible offsets [25]. The subse-
quent versions of YOLO diverged from this approach and devised a means to properly tackle the
Sigma (𝜎), which ensures that the value remains between 1 and 0 [26]. Given the anchor
box of size (𝑝𝑤 , 𝑝ℎ ) at the grid cell with its top left corner at (𝑐𝑥 , 𝑐𝑦 ), the model predicts the off-
set and the scale, (𝑡𝑥 , 𝑡𝑦 , 𝑡𝑤 , 𝑡ℎ ) and the corresponding predicted bounding box has center
(𝑏𝑥 , 𝑏𝑦 ) and size (𝑏𝑤 , 𝑏ℎ ). The confidence score is the sigmoid (𝜎) of another output to. Since
35
the location prediction is constrained, the parameterization is easier to learn, making the network
more stable. The employment of dimension clusters along with directly predicting the bounding
box’s center location improves YOLO by almost 5% over the version with anchor boxes n, m.
A bounding box is a relatively closed space within which points, objects, or a group of
objects may be contained. For YOLO, the cell in which the center of an object resides, is the cell
responsible for detecting that object. Each cell will predict B bounding boxes and a confidence
score for each box. The confidence score will be from ‘0.0’ to ‘1.0’, with ‘0.0’ being the lowest
confidence level and ‘1.0’ being the highest; if no object exists in that cell, the confidence scores
should be ‘0.0’, and if the model is completely certain of its prediction, the score should be ‘1.0’.
These confidence levels capture the model’s certainty that there exists an object in that cell and
that the bounding box is accurate. Each of these bounding boxes is made up of 5 numbers: the x
position, the y position, the width, the height, and the confidence. The coordinates ‘(x, y)’ repre-
sent the location of the center of the predicted bounding box, and the width and height are frac-
tions relative to the entire image size. The confidence represents the Intersection over Union
(IoU) between the predicted bounding box and the actual bounding box, referred to as the ground
truth box [27]. In general, there are five types of bounding boxes, i.e., a surrounding sphere (SS),
an axis-aligned bounding box (AABB), an oriented bounding box (OBB), a fixed-direction hull
(FDH), and a convex hull (CH) [29]. The AABB refers to a box whose axis is parallel to the co-
ordinate axis. It is the rectangle formed by selecting the maximum and minimum horizontal and
36
vertical coordinates in each vertex of the two-dimensional shape and is one of the most com-
monly used bounding box types [29]. Bounding box functions to mark the object area that has
In digital image processing, the bounding box is the coordinates of a rectangle wishing
which an object may be contained when it is placed over a page, a canvas, a screen, or any other
similar bi-dimensional background [30]. In the field of object detection, a bounding box is usu-
ally used to describe an object location. The bounding box is a rectangular box that can be deter-
mined by the 𝑥 and 𝑦 axis coordinates in the upper-left corner and the 𝑥 and 𝑦 axis coordinates in
the lower-right corner of the rectangle [31]. The first version of YOLO directly predicted all four
values which describes a bounding box. the x and y coordinates of each bounding box are de-
fined relative to the top left corner of each grid cell and normalized by the cell dimensions such
that the coordinate values are bounded between 0 and 1. However in YOLOv2 there was a shift
in paradigm and the algorithm employed dimensional clusters in place of anchor boxes, 4 coordi-
nates are predicted for each bounding box, 𝑡𝑥 , 𝑡𝑦 , 𝑡𝑤 , 𝑡ℎ , If the cell is offset from the top left cor-
ner of the image by (𝑐𝑥 , 𝑐𝑦 ) and the bounding box prior has width and height 𝑝𝑤 , 𝑝ℎ , then the
During training, the sum of squared error loss is used Assuming that the ground truth for
some coordinate prediction is 𝑡̂∗ our gradient is the ground truth value (computed from the
ground truth box) minus our prediction: 𝑡̂∗ − 𝑡∗ . This ground truth value can be easily computed
by inverting the equations above. YOLOv3 predicts an objectness score for each bounding box
using logistic regression. This should be 1 if the bounding box prior overlaps a ground truth ob-
ject by more than any other bounding box prior is not the best but does overlap a ground truth
object by more than some threshold the prediction is ignored. We use the threshold of .5. usu-
ally, a system only assigns one bounding box prior for each ground truth object. If a bounding
box prior is not assigned to a ground truth object it incurs no loss for coordinate or class predic-
The concept of a bounding box prior was introduced in YOLOv2. Previously, the model
was expected to provide unique bounding box descriptors for each new image, a collection of
bounding boxes is defined with varying aspect ratios which embed some prior information about
the shape of objects we are expecting to detect. Redmon offers an approach towards discovering
the best aspect ratios by doing k-means clustering (with a custom distance metric) on all of the
Thus, instead of predicting the bounding box dimension directly, the task is reformulated
to simply predict the offset from the bounding box prior in order to fine-tune the predicted
bounding box dimensions. The result of which is that it makes the prediction task easier to learn.
“objectness” score 𝑝𝑜𝑏𝑗 is trained to approximate the intersection over union between the
predicted box and the ground truth label. When the loss during training is calculated, Objects are
38
matched to whichever bounding box prediction on the same grid cell produced the highest 𝐼𝑜𝑈
score. For unmatched boxes, the only descriptor which will be included in the function is 𝑝𝑜𝑏𝑗 .
Upon the introduction of additional bounding box priors in YOLOv2, it was possible to
assign objects to whichever anchor box on the same grid cell has the highest IoU score with the
labeled object.
YOLO (version 3) redefined the "objectness" target score 𝑝𝑜𝑏𝑗 to be 1 for the bounding
boxes with highest 𝐼𝑜𝑈 score for each given target, and the score 0 for all remaining boxes.
However bounding boxes which have a high 𝐼𝑜𝑈 score above a defined threshold but not the
highest score when calculating the loss will not be included. This simply means that it does not
produce the most appropriate prediction because it is not the best possible prediction [33].
Class Labels
YOLO (version 3) uses sigmoid activations for multi-label classification, noting that
SoftMax (from the previous versions) is not necessary for good performance. This choice will
depend on your dataset and whether or not your labels overlap (e.g., "golden retriever" and
"dog").
Output Layer
YOLO (version 3) has 3 output layers. These output layers predict box coordinates at 3
different scales. The output prediction is of the form 𝑤𝑖𝑑𝑡ℎ × ℎ𝑒𝑖𝑔ℎ𝑡 × 𝑓𝑖𝑙𝑡𝑒𝑟𝑠.
YOLO (version 3) replaced the skip connection splitting for a more standard feature pyr-
amid network output structure. With this method, there is an alternate between outputting a pre-
diction and up sampling the feature maps (with skip connections). This allows for predictions
39
that can take advantage of finer-grained information from earlier in the network, which helps for
Pytesseract
Tesseract is an open-source text recognition (OCR) Engine, available under the Apache
2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from
images. It supports a wide variety of languages. Tesseract does not have a built-in graphics user
interface (GUI) [43]. Pytesseract is the tesseract framework for python language.
Legacy Tesseract 3.x was dependent on the multi-stage process where we can differenti-
ate steps into word finding, line finding, and character classification. Word finding was done by
organizing text lines into blobs, and the lines and regions are analyzed for fixed pitch or propor-
tional text. Text lines are broken into words differently according to the kind of character spac-
ing. Recognition then proceeds as a two-pass process. In the first pass, an attempt is made to rec-
ognize each word in turn. Each word that is satisfactory is passed to an adaptive classifier as
training data. The adaptive classifier then gets a chance to, more accurately, recognize text lower
Modernization of the Tesseract tool was an effort on code cleaning and adding a new
Long Short-Term Memory (LSTM) model. The input image is processed in boxes (rectangle)
line by line feeding into the LSTM model and giving output.
After adding a new training tool and training the model with a lot of data and fonts, Tes-
seract achieves better performance. Still, not good enough to work on handwritten text and weird
Tesseract does various image processing operations internally (using the Leptonica li-
brary) before doing the actual OCR. It generally does a very good job of this, but there will inev-
itably be cases where it is not good enough, which can result in a significant reduction in accu-
racy. Pre-processing includes but are not limited to inverting images, rescaling, binarization,
noise removal, dilation, and erosion, rotation / deskewing, border removal, etc. [44].
To avoid all the ways your tesseract output can drop, you need to make sure the image is
appropriately pre-processed. Given that our data is custom generated, we typically had a clear
and sharp images devoid of noise. As a result, we did not require pre-processing except for
A regular expression (regex or regexp for short) is a special text string for describing a
search pattern. You can think of regular expressions as wildcards on steroids. You are probably
familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex
equivalent is «.∗\. 𝑡𝑥𝑡» [45]. Regular expressions offer a limited but powerful metalanguage to
describe all kinds of formats, protocols, and other small textual languages. Regular expressions
arose in the context of formal language theory, and a primary use has been as part of scanners in
compilers. However, nowadays their applications extend far beyond those areas. For example,
regular expressions have been used in editors for structured text modification [46]. They are used
42
in network protocol analysis and for specifying events in a distributed system. Regular expres-
sions are used for virus detection using signature scanning, in mining the web, and as alternatives
types for XML data [46]. Moreover, there are many uses of regular expressions outside of com-
puter science, for example, in sociology (for characterizing events that led to placing of a child in
foster care) or biology (for finding DNA sequences) [46]. In addition to specific applications,
many generic types of information, such as phone numbers or dates, are often presented and ex-
changed in specific formats that are described using regular expressions [46].
Regular expressions, though powerful, are often problematic due to its complexity (ob-
scure, and hard to understand and reuse), errors (subtle and hard to detect faults), and version
proliferation (making it hard to select the right one for a specific task). Some complex regular
expression could exceed 100 characters, while some can go as high as over 4000 characters. This
Overview
In real-life application, speed, accuracy, and resources tradeoffs must be made. The inter-
net is a great resource for different classes of trained data ready for use. Unfortunately, there is
not one class ready made for recognition of URLs in an image. Most tests and trainings are done
with datasets, and their results are measured in mean Average Precision (mAP) at Intersection
over Union (IoU) threshold. IoU measures the overlap between two regions.
In this Chapter, we will highlight the custom generation of our training data from our
Data Preparation
A dataset is a collection of data. Image datasets are used to train and benchmark object
detection algorithms [40]. For the algorithm to be trained, we need to create images with URLs
in them and store the coordinates of the URLs. This would be used by the training algorithm to
know what is right (a URL) and use that to learn what is wrong (not a URL).
Python was our choice programing language and because YOLO is just a technique for
detection using convolutional neural network, we made use of pythons’ keras library as well.
44
Figure 15. Word bank used to generate our training image data.
Figure 16 shows a sample image generated from our word bank. This process is known as
tagging, and ours is automated. A total of sixty thousand images were generated for this experi-
ment, each having varying number of URL(s) in them and some having no URL at all, giving the
algorithm a large enough sample size. A record of the x and y coordinate of any URL generated
Figure 16. Sample generated image using our custom-built tagging tool.
45
Figure 17. Sample grid generated by YOLO for our detection algorithm.
Training
taught the skills for a particular job or activity. In this context, it refers to the process of teaching
an algorithm towards a specific task for which it will be used. These algorithms also learn from
experience without being explicitly programmed. In our experiment, we labelled ten thousand
images, each with varying number of URLs and some with no URL at all. We trained our algo-
rithm using fifty thousand images. These images have varying font size. Our image is sectioned
into 21 × 21 grid cells (Figure 17), each is responsible for predicting K bounding boxes. The
grid cell was selected because we would be working with only text. An object is considered to lie
46
in a specific cell only if the center co-ordinates of the anchor box lie in that cell. Due to this
property, the center co-ordinates are always calculated relative to the cell, whereas the height and
width are calculated relative to the whole image size. Using the Equation below, YOLO deter-
mines the probability that a cell contains a certain class. The class with the maximum probability
is chosen and assigned to that grid cell. This is repeated for all grid cells in the image. The proba-
𝑠𝑐𝑜𝑟𝑒𝑐 ,𝑖 = 𝑝𝑐 × 𝑐𝑖
Setting 0.3 filter size and 0.7 𝐼𝑜𝑈 threshold, we applied non-max suppression to select
the bounding box with the highest 𝐼𝑜𝑈 threshold. Batch normalization is then applied to add
noise to the inputs of every later, discouraging overfitting preventing our model from producing
deterministic values for a given training example. We later applied max pooling to down-sample
out input (the batch normalized output) followed by our activation function. We settled with Rec-
At the end of the training process, provided the trained algorithm with a new set of data
(with ten thousand images) to test our result. This new dataset is modelled like the training data.
47
Intersection over Union (IoU) is an evaluation metric (between 0 and 1) that measures the
overlap between the boundaries of the ground truth of an annotation and the predicted boundary.
IoU evaluates whether a prediction is “good enough” [36]. The closer the prediction is to 1, the
closer to perfect it is. Figure 19 illustrates the graphical view of the equation below.
𝐴𝑟𝑒𝑎 𝑜𝑓 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛/𝑂𝑣𝑒𝑟𝑙𝑎𝑝
𝐼𝑜𝑈 =
𝐴𝑟𝑒𝑎 𝑜𝑓 𝑈𝑛𝑖𝑜𝑛
In general, when 𝐼𝑜𝑈 ≥ 0.5, we consider the prediction correct. Knowing IoU, we calcu-
late precision and recall. We also calculated the confidence score. Confidence score reflects how
likely the box contains an object and how accurate the bounding box is.
Figure 20. The higher the IoU, the better the performance [24].
49
Overview
The training of our object detection algorithm was over 200 epochs. It took approxi-
mately 8 hours, efficiently making use of a core i7 NVIDIA GeForce GTX 1650 with Max-Q
Design computing power. During each epoch, the confidence metric improved over time, making
Figure 21. (a) Object detection algorithm after 10 epochs (b) Object detection algorithm for an
image without any link (c) Improved Object detection algorithm after 200 epochs.
Detection Accuracy
Using the online Kaggle training compute power, we were able to achieve better accu-
racy. Running the training on 500 epochs brought the result close to perfect.
50
Figure 22. Accuracy between two Optical Character Recognition techniques and our model on
ten im-ages using regular font style.
The OCR patter 1 technique used is a python library called “urlextract” while for the
other pattern, we used regular expression. Before selecting the regular expression to use, we tried
different ones from stackoverflow and chose the most accurate (see Figure 23).
The reason for our result in image 1 (see Figure 24) detecting an extra image is because
of overfitting. However, if converted into text, would be quickly eliminated because of the nature
of the area of interest (see Figure 25). With improved tunning and more training, we should ex-
technique were used is, the Optical Character Recognition (OCR) accuracy. The OCR we used is
the pytesseract image to string library. It is a deep learning model that detects, recognizes and
outputs texts based on learned patterns. This technique could get worse in an unconstrained envi-
Figure 25. OCR image conversion to text versus our object detection model.
Font Style and Size
Font style and size significantly affects the visibility of a text. This is also an important
factor to consider when creating a computer vision model to recognize texts. There are different
categories of font styles. We will consider serifs and sans-serifs briefly. A serif is a decorative
stroke that finishes off the end of a letters stem (sometimes also called the “feet” of the letters). A
OCR software “won the battle” on italicized texts. This is simply because our model was
trained on strictly serif and sans-serif fonts. Verdana (sans-serif), Arial (sans-serif), Segoe UI
(sans-serif), Calibri (sans-serif), Cambria (serif) font style were used to train our model. The font
size of our training images ranged between 14 pixels to 18 pixels. Our algorithm got better accu-
racy when the test images have similar texts font style.
In Figure 27, our model was evaluated on about 100 images with texts of font family of
Brush Script MT and Bauhaus and font size between 24 and 28.
53
Figure 27. Detection Accuracy between Optical Character Recognition techniques and our
model on ten images using italicized font style.
Training our model on these variants of font styles (and font size) should fix the less ac-
Input Size
Using 400 x 400 input size image for training, we tested our algorithm on larger image
sizes 1200 x 1200 and 600 x 600. Given that pytesseract OCR has been trained on a wide range
of variety of input sizes and italicized texts with different font family, all of which pytesseract
has.
55
Chapter V: Conclusion
Overview
The title of this project is “Object Detection and Recognition Using YOLO: Detect and
Recognize URL(s) in an Image Scene”. This translates into the usage of object detection and
To achieve this, a custom YOLO algorithm is trained on fifty thousand images and evalu-
ated on ten thousand images. The algorithm employs a 21 x 21 grid. When set up this way, it al-
lows the bounding boxes generated to capture the objects properly with minimal chances of over-
fitting.
To generate training data, a text generator was programmed to generate images which
randomly contain both texts and URLs and records useful information about the location of the
URLs in a text file in JSON format. About fifty thousand images are used to train the model in
Application
The application of the algorithm is quite restricted, with its main use being the detection
of URLs in an image scene. On its own, the application of the model is not wide and varied.
However, in combination with other modifications, such as the augmentation this algorithm can
prove to be very useful and beneficial. The project has a few applications in various fields. These
The model can be particularly useful in the field of tourism. Tourist can take a picture of
a URL and have the information of the website displayed without having to type on their key-
56
board. Business owners, shop owners, and their customers can advertise and publicize infor-
mation by leaving URLs to the services they offer on their advertisement spaces and users can
pick up the URL(s) by the click of a button. The model can also be used to capture reference
URLs while listening to a presentation at a conference. The uniqueness in of this model is that
valuable time need not be wasted by users and eliminating the possibility of not being able to
completely type out a URL in time in the case of a presentation. Given the ability of smart phone
cameras in recent times, taking a picture of a URL while in transit should result in a “not too
blurry” image which can be passed in as an input to our model and the URL(s) of interest re-
trieved.
The software can also be utilized at auctions that provide URL(s) to read more about an
item. With the utilization of this model, information may be passed by appending URL(s) con-
Another situation where the use of this model is in museum and zoos. Less space needs to
be taken up when describing an item. A brief description of an item and a link to the full descrip-
tion would make museums and zoos less congested with placards. If implemented in the form of
As opposed to QR codes, this model does not require any special software to encode the
URL. QR codes cannot be handwritten but if our model is extended to handwriting recognition,
Limitations of Work
One of the major limitations of this work is handwriting detection. The data which the al-
gorithm is trained with is restricted entirely to digitally written text. As such, URLs written in
Furthermore, the algorithm can only recognize horizontal texts. As a result of this, it is
limited in functionality such that vertical as well as diagonal texts will be impossible to detect by
the algorithm.
Due to the type of data the algorithm has been trained with; it is also impossible for the
algorithm to detect links that span multiple lines. This is because the bounding boxes trained are
Future work
Implementing an OCR alongside our model would make converting URLs in a picture to
clickable text. This would greatly reduce the need for advanced software like QR codes genera-
tors and QR code detectors. A person would need to simply have a URL displayed and anybody
The second improvement would be to attempt to increase the range of the URLs that can
be detected by the model. So far, our model was trained on slightly varying font size (see Figure
29). In general, font type should not create a significant difference in the accuracy of our model.
It would be great to detect vertically aligned URLs (although that is not a common case in real
life), and to detect URLs at different angles. Another case would be detecting URLs written
backwards.
58
noisy environments and fonts to improve its ability to recognize a wider range of texts effec-
tively.
59
References
[1] P. Chakravorty, "What Is a Signal? [Lecture Notes]," in IEEE Signal Processing Magazine,
vol. 35, no. 5, pp. 175-177, Sept. 2018, doi: 10.1109/MSP.2018.2832195.
[2] M. Rouse, “image” [Online]. Available at: https://whatis.techtarget.com/definition/image.
[Accessed: 01/10/20]
[3] Merriam-Webster “image” [Online]. Available at: https://www.merriam-webster.com/dic-
tionary/image, [Assessed: 01/10/20]
[4] Wikipedia, “Image” [Online]. Available at: https://en.wikipedia.org/wiki/Image, [Assessed:
01/10/20]
[5] The Editors of Encyclopedia Britannica, “Image-processing” [Online], Available at:
https://www.britannica.com/technology/image-processing. [Assessed: 01/10/20]
[6] Wikipedia, “Encoding images” [Online], Available at:
https://www.bbc.co.uk/bitesize/guides/zqyrq6f/revision/3. [Assessed: 02/10/20]
[7] Object (image processing) [Online], Available at: https://en.wikipedia.org/wiki/Object_(im-
age_processing) , [Assessed: 02/10/20]
[8] P. Ganesh, Object Detection: Simplified [Online], Available at: https://towardsdatasci-
ence.com/object-detection-simplified-e07aa3830954, [Assessed: 02/10/20]
[9] Tensorflow, Available at: https://www.tensorflow.org/lite/models/object_detection/overview,
[Assessed: 03/10/20]
[10] Wikipedia, “Object detection” Available at: ‘https://en.wikipedia.org/wiki/Object_detection,
[Assessed: 02/10/20]
[11] Fritz, “Object Detection Guide”, Available at: https://www.fritz.ai/object-detection/, [As-
sessed: 03/10/20]
[12] Wikipedia, “Outline of object recognition”, Available at: https://en.wikipedia.org/wiki/Out-
line_of_object_recognition , [Assessed: 02/10/20]
[13] Object Recognition, Available at: cse.usf.edu/~r1k/MachineVisionBook/Machine-
Vision.files/MachineVision_Chapter15.pdf, [Assessed: 03/10/20]
[14] N. Pinto, D. D. Cox, and J.J. DiCarlo, “Why is Real-World Visual Object Recognition
Hard?” (2008) PLoS Comput Biol 4(1): e27.
[15] A. Gulli and P. Sujit, “Deep Learning with Keras” (2017), Available at:
https://1lib.us/book/3411804/7ea47a?id=3411804. [Assessed: 03/10/20]
[16] “The Definitive Glossary of Higher Mathematical Jargon - Algorithm". Available at:
https://mathvault.ca/math-glossary/ [Assessed: 03/10/20]
[17] “Definition of ALGORITHM". Merriam-Webster Online Dictionary. Available at:
https://www.merriam-webster.com/dictionary/algorithm [Assessed: 04/10/20]
[18] Y. Gavrilova, “Artificial Intelligence vs. Machine Learning vs. Deep Learning: Essentials”
Available at https://serokell.io/blog/ai-ml-dl-difference [Assessed: 04/10/20]
[19] J. Brownlee, “A Gentle Introduction to Object Recognition with Deep Learning” (2018)
Available at: https://machinelearningmastery.com/object-recognition-with-deep-learn-
ing/ [Assessed: 01/10/20]
60
[20] A. Kamal, “YOLO, YOLOv2 and YOLOv3: All You want to know”(2019) Available at:
https://medium.com/@amrokamal_47691/yolo-yolov2-and-yolov3-all-you-want-to-
know-7e3e92dc4899 [Assessed: 04/10/20]
[21] You Only Look Once: Unified, Real-Time Object Detection, 2015. Available at:
https://arxiv.org/abs/1506.02640, [Assessed: 04/10/20]
[22] A. Rosebrock., “Intersection over Union (IoU) for object detection” (2016) Available at:
https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-de-
tection/, [Assessed: 04/10/20]
[23] I. Tan, “Measuring Labelling Quality with IOU and F1 Score”, Available at: https://me-
dium.com/supahands-techblog/measuring-labelling-quality-with-iou-and-f1-score-
1717e29e492f , [Assessed: 01/10/20]
[24] StackOverflow, “Intersection Over Union (IoU) ground truth in YOLO”, Available at:
https://stackoverflow.com/questions/61758075/intersection-over-union-iou-ground-truth-
in-yolo, [Assessed: 02/10/20]
[25] J. Redmon & A. Farhadi, (University of Washington), “YOLO9000: Better, Faster,
Stronger” Available at: https://pjreddie.com/media/files/papers/YOLO9000.pdf, [As-
sessed: 02/10/20]
[26] “YOLO v2 – Object Detection” Available at: https://www.geeksforgeeks.org/yolo-v2-ob-
ject-detection/ [Assessed: 04/10/20]
[27] A. Aggarwal, “YOLO Explained” Available at: https://medium.com/analytics-vidhya/yolo-
explained-5b6f4564f31 [Assessed: 04/10/20]
[28] K. Mahesh Babu, M.V. Raghunadh, Vehicle number plate detection and recognition using
bounding box method, May 2016, pp 106–110
[29] L. Cai, F. Jiang, W. Zhou, and K. Li, Design and Application of An Attractiveness Index for
Urban Hotspots Based on GPS Trajectory Data, (Fellow, IEEE), pg 4
[30] Wikipedia, “Minimum bounding box” Available at: https://en.wikipedia.org/wiki/Mini-
mum_bounding_box, [Assessed: 04/10/20]
[31] Dive into Deep Learning, “13.3. Object Detection and Bounding Boxes” Available at:
https://d2l.ai/chapter_computer-vision/bounding-box.html , [Assessed: 04/10/20]
[32] J. Redmon & A. Farhadi, YOLOv3: An Incremental Improvement, University of Washing-
ton Available at: https://arxiv.org/abs/1804.02767 [Assessed: 05/10/20]
[33] YOLO: You Only Look Once, Available at: jeremyjordan.me/object-detection-one-
stage/#yolo, [Assessed: 05/10/20]
[34] Longman Dictionary, Definition of training, Available at: https://www.ldoceonline.com/dic-
tionary/training [Assessed: 05/10/20]
[35] Guangrui Liu “Real-Time Object Detection for Autonomous Driving Based on Deep Learn-
ing” (2017), Available at: https://tamucc-ir.tdl.org/handle/1969.6/5637 [Assessed:
03/18/21]
[36] A. Abdulkader & C. Vlahija “Real-time vehicle and pedestrian detection, a data-driven rec-
ommendation focusing on safety as a perception to autonomous vehicles”, Available at:
http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1479957&dswid=-3676 [As-
sessed: 03/18/21]
61