Yolo
Yolo
Yolo
YOLO-V1 in Python
Instructor: Jaskirat Singh
To provide a comprehensive documentation of the YOLO-V1 code you’ve shared, I’ll first
review the code you’ve provided. Let me read the file content to understand its structure and
functions, and then I will generate the detailed line-by-line documentation.
Let me first open and read the file content.
The code starts by importing necessary libraries, defining an anchor boxes function, and
then proceeds with another function called centroid2minmax. I can already see that the
anchor boxes function generates anchor boxes for different grid cells based on aspect ratios,
likely used in YOLO for bounding box predictions.
I’ll now provide a detailed explanation and documentation for each part of the code provided
so far, and then continue to the rest. Let’s start with the function anchor boxes.
1 Imports
1 import pandas as pd
2 import numpy as np
3 from keras . applications import VGG16
4 from keras . layers import Conv2D
5 from keras . models import Model
6 import pathlib
7 import os
8 from xml . etree import ElementTree as ET
9 import cv2
10 import matplotlib . pyplot as plt
11 import json
12 import ast
• pandas as pd, numpy as np: These imports are for data manipulation and numerical
operations. pandas is useful for handling structured data, while numpy helps with array
operations.
• keras.applications.VGG16, keras.layers.Conv2D, keras.models.Model: These im-
ports from Keras are used for constructing and modifying neural networks. VGG16 is a
pre-trained convolutional neural network, Conv2D is used to add convolution layers, and
Model allows defining the overall structure of the neural network.
1
• pathlib, os: These libraries are used for handling file paths and performing operations on
the filesystem.
• xml.etree.ElementTree as ET: Useful for parsing XML files, which are often used for
storing bounding box annotations.
• cv2: This is OpenCV, used for image processing and computer vision tasks.
• matplotlib.pyplot as plt: This library is used for visualizing images and other data
plots.
• json, ast: json is used to parse and generate JSON data, while ast helps convert strings
into Python expressions.
• image size: A tuple representing the dimensions of the image (width, height, channels).
• grids size: A tuple indicating the number of grids in each direction (grid width,
grid height).
• aspect ratios: A list of aspect ratios for anchor boxes.
This unpacks image size to get the image width and image height. The third component
(channels) is not needed here.
1 grid_width = image_width // grids_size [0]
2 grid_height = image_height // grids_size [1]
grid width and grid height: Calculate the dimensions of each grid cell by dividing the image
dimensions by the number of grids.
1 g r i d _ c e n t e r _ x _ s t a r t = grid_width // 2
2 g r i d _ c e n t e r _ x _ e n d = int (( grids_size [0] - 0.5) * grid_width )
3 grid_center_x = np . linspace ( grid_center_x_start , grid_center_x_end , grids_size
[0])
• grid center x start: The x-coordinate of the center of the first grid.
• grid center x end: The x-coordinate of the center of the last grid.
• grid center x: Generates a set of equally spaced x-coordinates for the centers of each grid
cell.
1 g r i d _ c e n t e r _ y _ s t a r t = grid_height // 2
2 g r i d _ c e n t e r _ y _ e n d = int (( grids_size [1] - 0.5) * grid_height )
3 grid_center_y = np . linspace ( grid_center_y_start , grid_center_y_end , grids_size
[1])
2
These lines are analogous to the previous block but calculate the y-coordinates for the center of
each grid cell.
1 grid_center_x_mesh , g r i d _ c e n t e r _ y _ m e s h = np . meshgrid ( grid_center_x ,
grid_center_y )
np.meshgrid: Generates a coordinate grid for the x and y centers. This is used to determine
the centers of all grid cells.
1 g r i d _ c e n t e r _ x _ m e s h = np . expand_dims ( grid_center_x_mesh , -1)
2 g r i d _ c e n t e r _ y _ m e s h = np . expand_dims ( grid_center_y_mesh , -1)
np.expand dims: Adds an extra dimension to grid center x mesh and grid center y mesh so
they can be broadcasted easily for tensor operations.
1 a nc ho r_ b ox es _ no = len ( aspect_ratios )
anchor boxes no: The number of different anchor boxes to create for each grid cell based on the
number of aspect ratios provided.
1 a n c h o r _ b o x e s _ t e n s o r = np . zeros (( grids_size [0] , grids_size [1] , anchor_boxes_no ,
4) )
anchor boxes tensor: Initializes a tensor to store the information of all anchor boxes. The
shape (grids size[0], grids size[1], anchor boxes no, 4) represents the number of grids
in x and y, the number of anchor boxes, and four coordinates (x center, y center, width,
height).
1 a n c h o r _ b o x e s _ t e n s o r [... , 0] = np . tile ( grid_center_x_mesh , (1 , 1 ,
a nc ho r _b ox es _ no ) )
2 a n c h o r _ b o x e s _ t e n s o r [... , 1] = np . tile ( grid_center_y_mesh , (1 , 1 ,
a nc ho r _b ox es _ no ) )
Assigns the center coordinates (x center, y center) for each anchor box across the grid cells.
1 a n c h o r _ b o x _ w i d t h _ h e i g h t = list ()
Initializes an empty list to store the width and height of anchor boxes.
1 for aspect_ratio in aspect_ratios :
2 a n c h o r _ b o x _ w i d t h _ h e i g h t . append ((158 * np . sqrt ( aspect_ratio ) , 173 / np . sqrt
( aspect_ratio ) ) )
3 a n c h o r _ b o x _ w i d t h _ h e i g h t = np . array ( a n c h o r _ b o x _ w i d t h _ h e i g h t )
• aspect ratio loop: Iterates through each aspect ratio to calculate the width and height
of the corresponding anchor boxes, appending them to the list.
• The values 158 and 173 are used to determine the initial scale of anchor boxes.
1 a n c h o r _ b o x e s _ t e n s o r [... , 2] = a n c h o r _ b o x _ w i d t h _ h e i g h t [: , 0]
2 a n c h o r _ b o x e s _ t e n s o r [... , 3] = a n c h o r _ b o x _ w i d t h _ h e i g h t [: , 1]
Assigns the calculated width and height to the appropriate positions in anchor boxes tensor.
1 return a n c h o r _ b o x e s _ t e n s o r
Returns the final tensor of anchor boxes, which contains information about the center coordinates
and dimensions for each anchor box in every grid cell.
I’ll now proceed to describe the next function, centroid2minmax, in detail.
Continuing with the centroid2minmax and then moving on to the compute IoU function:
3
3 Function: centroid2minmax
This function converts bounding box coordinates from centroid format (x center, y center,
width, height) to min-max format (x min, y min, x max, y max). This format is often more
useful for intersection-over-union (IoU) calculations and drawing bounding boxes.
1 def c en tr o id 2m in m ax ( a n c h o r _ b o x e s _ c e n t r o i d _ t e n s o r ) :
• Takes in a tensor of bounding boxes in centroid format (anchor boxes centroid tensor),
which has the shape (grid width, grid height, anchor boxes no, 4).
1 a n c h o r _ b o x e s _ m i n m a x _ t e n s o r = np . copy ( a n c h o r _ b o x e s _ c e n t r o i d _ t e n s o r )
Converts the x center and y center to x min and y min by subtracting half of the width and
height, respectively.
1 a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [... , 2] = a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [... , 0] + (
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [... , 2] // 2)
2 a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [... , 3] = a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [... , 1] + (
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [... , 3] // 2)
Converts width and height to x max and y max by adding half of the width and height to x min
and y min.
1 return a n c h o r _ b o x e s _ m i n m a x _ t e n s o r
1 i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s = np . array ( i m a g e _ g t _ b b o x _ c o o r d s )
Converts the input list of ground truth bounding box coordinates into a numpy array.
1 i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 0] = i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 0] + (
i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 2] - i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 0]) //
2
2 i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 1] = i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 1] + (
i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 3] - i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 1]) //
2
Converts the ground truth bounding box coordinates into centroid format.
4
• image gt bbox centroid coords[:, 0]: Updates x center by adding half the width to
x min.
• image gt bbox centroid coords[:, 1]: Updates y center by adding half the height to
y min.
1 i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 2] = ( i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 2] -
i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 0])
2 i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 3] = ( i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 3] -
i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [: , 1])
Calculates the width and height by subtracting the minimum coordinates from the maximum.
1 IoU_tensor = np . zeros (( len ( i m a g e _ g t _ b b o x _ c o o r d s ) , a n c h o r _ b o x e s _ m i n m a x _ t e n s o r .
shape [0] , a n c h o r _ b o x e s _ m i n m a x _ t e n s o r . shape [1] , a n c h o r _ b o x e s _ m i n m a x _ t e n s o r .
shape [2]) )
2 b b o x _ p r e s e n t _ i d x e s = [[]] * len ( i m a g e _ g t _ b b o x _ c o o r d s )
3 IoU_thresh = 0.25
• IoU tensor: Initializes a tensor to store IoU values between each ground truth box and
each anchor box.
• bbox present idxes: Keeps track of which anchor boxes are associated with ground truth
bounding boxes.
• IoU thresh: Threshold for determining a positive match between an anchor box and a
ground truth bounding box.
A nested loop iterates over each ground truth box (i) and each anchor box (j).
• centroid x condition anchor boxes and centroid y condition anchor boxes: These
lines (partially shown) are checking whether the centroid of the ground truth box lies within
the bounds of each anchor box.
I can see that the loop checks the overlap conditions between ground truth and anchor boxes
to compute the IoU, which will then be used for loss calculations or object detection evaluations.
I will now extract more content to provide further detailed documentation, as it seems the
implementation is large and might contain other important components like loss calculation,
prediction, and model training.
5
1 c e n t r o i d _ x _ c o n d i t i o n _ a n c h o r _ b o x e s = (( i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [i , 0] >
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 0]) &
2 ( i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [i , 0] <
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 2]) )
3 c e n t r o i d _ y _ c o n d i t i o n _ a n c h o r _ b o x e s = (( i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [i , 1] >
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 1]) &
4 ( i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [i , 1] <
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 3]) )
5 g r i d _ c e l l s _ i dx e s = np . argwhere ( c e n t r o i d _ x _ c o n d i t i o n _ a n c h o r _ b o x e s &
centroid_y_condition_anchor_boxes )
6 b b o x _ p r e s e n t _ i d x e s [ i ]. append ( gr i d _ c e l l s _ i d x e s )
• centroid x condition anchor boxes and centroid y condition anchor boxes: These
lines calculate if the centroid (x center and y center) of the ground truth bounding box
lies within the bounds of the anchor boxes. They use logical conditions to determine
whether the centroids fall within the range of the min and max x/y coordinates.
• grid cells idxes: Uses np.argwhere to get the indices where the centroid conditions are
true, indicating which grid cells contain parts of the ground truth box.
• bbox present idxes[i].append(grid cells idxes): Stores the indices for further pro-
cessing.
1 x m i n _ i n t e r s e c t i o n = np . maximum ( i m a g e _ g t _ b b o x _ c o o r d s [ i ][0] ,
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 0])
2 y m i n _ i n t e r s e c t i o n = np . maximum ( i m a g e _ g t _ b b o x _ c o o r d s [ i ][1] ,
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 1])
3 x m a x _ i n t e r s e c t i o n = np . minimum ( i m a g e _ g t _ b b o x _ c o o r d s [ i ][2] ,
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 2])
4 y m a x _ i n t e r s e c t i o n = np . minimum ( i m a g e _ g t _ b b o x _ c o o r d s [ i ][3] ,
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 3])
• xmin intersection and ymin intersection: Calculate the intersection box’s minimum
x and y coordinates by taking the maximum of the corresponding coordinates of the ground
truth and anchor box.
• xmax intersection and ymax intersection: Similarly, calculate the intersection box’s
maximum x and y coordinates by taking the minimum of the corresponding coordinates.
1 i n t e r s e c t i o n _ w i d t h = np . maximum (0 , ( x m a x _ i n t e r s e c t i o n - x m i n _ i n t e r s e c t i o n ) )
2 i n t e r s e c t i o n _ h e i g h t = np . maximum (0 , ( y m a x _ i n t e r s e c t i o n - y m i n _ i n t e r s e c t i o n ) )
3 intersection_area = intersection_width * intersection_height
• intersection width and intersection height: The width and height of the intersection
box. If there is no overlap, these values will be 0.
• intersection area: Calculates the area of the intersection by multiplying width and
height.
1 i m a g e _ g t _ b b o x _ a r e a = i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [i , 2] *
i m a g e _ g t _ b b o x _ c e n t r o i d _ c o o r d s [i , 3]
2 a n c h o r _ b o x e s _ w i d t h = ( a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 2] -
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 0])
3 a n c h o r _ b o x e s _ h e i g h t = ( a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 3] -
a n c h o r _ b o x e s _ m i n m a x _ t e n s o r [: , : , j , 1])
6
• image gt bbox area: The area of the ground truth bounding box.
• anchor boxes width and anchor boxes height: Calculates the width and height of the
anchor boxes for the IoU calculation.
1 union_area = (( a n c h o r _ b o x e s _ w i d t h * a n c h o r _ b o x e s _ h e i g h t ) + i m a g e _ g t _ b b o x _ a r e a )
- intersection_area
2 IoU_tensor [i , : , : , j ] = i n t e r s e c t i o n _ a r e a / union_area
• union area: The union of the anchor box and ground truth box areas. It is calculated as
the sum of both areas minus the intersection area.
• IoU tensor: Stores the calculated IoU for each ground truth box with each anchor box.
This line appends the indices of anchor boxes whose IoU is greater than 0, indicating that they
overlap with the ground truth bounding box.
1 I o U _ t e n s o r _ r e d u c e d = np . max ( IoU_tensor , axis =0)
2 a n c h o r _ b o x e s _ g t _ m a s k = np . float64 ( I o U _ t e n s o r _ r e d u c e d > IoU_thresh )
• IoU tensor reduced: Collapses the IoU tensor along the axis corresponding to ground
truth boxes to find the maximum IoU per anchor box.
• anchor boxes gt mask: Generates a binary mask where anchor boxes with an IoU above
the threshold (IoU thresh = 0.25) are set to 1. This mask is useful for determining which
anchor boxes are ”positive matches.”
1 return i mag e_g t_b box _c ent roi d_c oo rds , anchor_boxes_gt_mask , bbox_present_idxes
, IoU_tensor_reduced
7
• image size: The dimensions of the image (width, height, channels).
• bbox present idxes: Indices of grid cells where bounding boxes are present.
• image gt bbox centroid coords: Ground truth bounding box coordinates in centroid
format.
• anchor boxes minmax tensor: Tensor containing the anchor boxes’ min-max coordinates.
• image width, image height: Extract the width and height of the image from image size.
• normalized image gt bbox coords: Initializes a tensor of the same shape as anchor boxes minmax tensor
to store the normalized coordinates.
This nested loop iterates through all ground truth bounding boxes (i) and anchor boxes (j) to
normalize each.
• idx = bbox present idxes[i][j]: Retrieves the indices of grid cells related to the current
ground truth bounding box.
• The following lines then normalize the x and y centroid coordinates, as well as the width
and height:
– normalized image gt bbox coords[..., 0]: Normalizes x center by dividing by
the anchor box width.
– normalized image gt bbox coords[..., 1]: Normalizes y center by dividing by
the anchor box height.
– normalized image gt bbox coords[..., 2]: Normalizes the width by dividing by
the image width.
– normalized image gt bbox coords[..., 3]: Normalizes the height by dividing by
the image height.
1 return n o r m a l i z e d _ i m a g e _ g t _ b b o x _ c o o r d s
Returns the normalized coordinates tensor for each grid cell and anchor box.
8
6 Function: create gt labels tensor
This function creates a tensor of ground truth labels for training, combining the bounding box
information with class labels and confidence scores. This is crucial for the YOLO model as it
allows the network to learn both the object class and the bounding box predictions.
1 def c r e a t e _ g t _ l a b e l s _ t e n s o r ( n o r m al i z ed _ i m ag e _ gt _ b b ox _ c oo r d s , IoU_tensor ,
bbox_present_idxes , image_cls_labels , num_classes ) :
• normalized image gt bbox coords: The tensor of normalized bounding box coordinates.
• IoU tensor: The IoU values between anchor boxes and ground truth boxes.
• bbox present idxes: Indices for bounding boxes in different cells.
• image cls labels: List of class labels for each ground truth box.
• num classes: Number of classes in the dataset.
cls probabilities tensor: Initializes a tensor to store the one-hot encoded class probabilities
for each grid cell.
1 for i in range ( len ( b b o x _ p r e s e n t _ i d x e s ) ) :
2 idx_0 = b b o x _ p r e s e n t _ i d x e s [ i ][0]
3 idx_1 = b b o x _ p r e s e n t _ i d x e s [ i ][1]
4 c l s _ p r o b a b i l i t i e s _ t e n s o r [ idx_0 [: , 0] , idx_0 [: , 1] , :] = np . eye ( num_classes
, num_classes ) [ i m a g e _ c l s _ l a b el s [ i ]]
5 c l s _ p r o b a b i l i t i e s _ t e n s o r [ idx_1 [: , 0] , idx_1 [: , 1] , :] = np . eye ( num_classes
, num_classes ) [ i m a g e _ c l s _ l a b el s [ i ]]
1 g t _ l a b e l s _ t e ns o r = np . copy ( n o r m a l i z e d _ i m a g e _ g t _ b b o x _ c o o r d s )
2 c o n f i d e n c e _ s c o r e s = np . expand_dims ( IoU_tensor , -1)
3 g t _ l a b e l s _ t e ns o r = np . concatenate (( gt_labels_tensor , c o n f i d e n c e _ s c o r e s ) , axis
=3)
• confidence scores: Adds a new dimension to IoU tensor to represent the confidence
score for each bounding box.
• Concatenates the confidence scores to the ground truth tensor along the last axis.
9
1 g t _ l a b e l s _ t e ns o r = g t _ la b e l s _ t e n s o r . reshape ( g t _ l a b e l s _ t e n s or . shape [0] ,
g t _ l a b e l s _ t e n so r . shape [1] , g t _ l a b e l s _ t e n s o r . shape [2] * g t _ l a be l s _ t e n s o r . shape
[3])
2 g t _ l a b e l s _ t e ns o r = np . concatenate (( gt_labels_tensor , c l s _ p r o b a b i l i t i e s _ t e n s o r )
, axis =2)
• gt labels tensor.reshape: Flattens the bounding box coordinates and confidence scores
to match the final output shape needed.
• Concatenate cls probabilities tensor: Adds the class probabilities to the gt labels tensor,
forming the final tensor for training.
– Initializes a VGG16 model without the fully connected layers (include top=False).
It accepts an input shape of (480, 640, 3) representing an image with height 480
pixels, width 640 pixels, and 3 color channels (RGB).
– The weights are loaded from the pre-trained ImageNet model.
• vgg16.trainable = False:
– Freezes the VGG16 layers to prevent them from being updated during training, thus
utilizing it as a fixed feature extractor.
• input to vgg16 = vgg16.input:
– Stores the input layer of the VGG16 model.
10