The Applications of Machine Learning and Computer Vision Algorithms To Aid People With Vision Impairment.
The Applications of Machine Learning and Computer Vision Algorithms To Aid People With Vision Impairment.
Introduction
Computer vision is a field of machine learning1 which has revolutionised the way machines understand and inter-
pret visual information, with its popularity soaring in recent years. However, individuals with vision impairment
face numerous challenges in their daily life and the number of blind people has risen globally from 34.4 million in
1990 to 49.1 million in 2020 (Bourne, R.R.A, June 2020) – almost a 45% increase. This highlights the urgent need
for innovative solutions to improve their accessibility and quality of life.
One of the most significant ways in which machine learning algorithms can be used to help people with vision
impairments is through their use in navigation which can provide individuals with detailed spatial awareness and
orientation. This is particularly useful in crowded or unfamiliar spaces where traditional navigation methods, such
as using a guide dog, can be limiting in terms of mobility. According to a study by Baxter and Beresford on the ef-
fectiveness of assistance dogs in public buildings, 90% of responses classified guide dogs as a “nuisance” and
caused “hygiene issues”. These limitations of guide dogs further exacerbate the difficulties that blind people en -
counter when navigating crowded or unfamiliar spaces (2016, Baxter & Beresford). Furthermore, blind people
learn alternative skills to navigate themselves. This is through echoes, texture changes and auditory cues. How-
ever, these skills take a substantial time to learn and use confidently, so computer vision technology can be utilized
to recognize and warn of hazards like obstacles - further increasing their ability to navigate safely. This paper ex-
plores methods to bridge the gap between visual information and non-visual perception, offering an encouraging
path towards creating a more inclusive and empowering society.
Volunteering
Volunteering at a blind school, I was given the incredible opportunity to witness first-hand how hard-working staff
accomplished remarkable tasks each day. After closely observing the challenges encountered by visually impaired
individuals, I carefully evaluated current technology solutions deployed and noted areas for potential improve -
ment. Current assistive technologies are more geared towards educational aspects such as Braille readers and text-
to-speech conversion, however; they fail to tackle the basic problem of navigating around a classroom independ -
ently. I plan on adapting my program accordingly to meet these challenges by including object detection al-
gorithms, robust depth estimation techniques and efficient optical character recognition methods. My goal is to de-
velop a comprehensive solution that allows the visually impaired to be more independent by integrating these com-
ponents.
The Program
A proposed solution to address these challenges involves developing wearable ‘smart glasses’, similar to what
Google unveiled in 2014. However, these smart glasses will be tailored specifically to help individuals with vision
impairment. Ideally, this combination of algorithms should work collaboratively to enhance the wearer's visual
perception, helping them identify objects, understand their surroundings, estimate distances, and recognize written
text all in one program, however; due to limitations in computing power, simultaneous execution of these machine
learning models will not be feasible. The program will be coded in Python due to its vast community and 3 rd party
packages such as OpenCV which is used for image processing. This paper presents each algorithm separately, ana-
lysing its functionalities and performance.
Object Detection
1
Machine Learning is the field of computing related to giving "computers the ability to learn without being explicitly programmed." -
Samuel Arthur (1959),
Page 2 of 16
Siddhanth Kheria ERP 2023
Object detection works by receiving an input in the form of an image from a camera. Next, it analyses the input us-
ing a neural network2 to extract relevant features such as shapes, textures and colours that help identify objects.
Then the algorithm identifies the potential regions of interest (ROI). These regions then undergo a classification al -
gorithm to determine whether the region contains an object or not and goes through all its pretrained images and
labels and assigns a confidence score3 to each region. If this confidence score is greater than a certain probability, a
bounding box is formed around the region and post-processing takes place4 (Ali, S - 2023).
Figure 1 shows the basic object detection model. However, this model just detects whether a frame contains an ob-
ject or not giving the binary output: yes/no. This will be taken this further by attaching the correct label to the de -
tected object.
Figure 1 – Sharma, K.U. and Thakur, N.V. (2017) A review and an approach for object detection in images.
The object detection model does not use a standard convolutional neural network (CNN) followed by a fully con-
nected layer because the length of the output layer is not constant (Kundu, R 2023). There can be multiple in-
stances of the same object in the same frame, e.g., multiple people walking in front of the impaired person. A
method to address this challenge is to use a network which involves analysing multiple ROIs within the frame to
detect multiple instances. However, this approach has one major limitation: significant computational complexit-
ies. It would require many regions to be selected, leading to a potential overload in computational power (Gandhi,
R - 2018). As a result, the model made use of 3rd party algorithms such as You Only Look Once (YOLO), R-
CNN, and SSD300 which have been developed with high accuracy and performance.
2
A neural network is a method in artificial intelligence that teaches computers to process data in a way that is inspired by the human
brain. (Amazon AWS)
3
Value between 0 and 1 which is the probability that a region contains an object or not.
4
Post-processing ensures that the output is accurate and redundant data is removed.
Page 3 of 16
Siddhanth Kheria ERP 2023
Figure 2 - Graph showing speed vs accuracy of different object detection algorithms available at https://cv-tricks.com/object-detec-
tion/faster-r-cnn-yolo-ssd
The YOLO model was implemented due to the faster processing speed whilst maintaining high accuracy. Al-
though Faster RCNN has the highest accuracy, its speed is considerably slower. In our specific application, speed
was prioritised as any loss in frames can cause life-threatening consequences for the impaired user.
The YOLO algorithm takes an image as an input and then uses a convolutional neural network to detect objects in
the frame. The architecture of the deep neural network is shown in Figure 3 (Kundu, R 2023).
The first 20 layers of the model are trained using ImageNet [1] which includes a temporary average pooling and a
fully connected layer (Kundu, R 2023). This pre-trained model is then adapted for object detection by adding con-
volutional and connected layers which improves accuracy and performance. Finally, the last fully connected layer
is responsible for predicting both class probabilities and bounding box coordinates (Redmon, J et al. - 2016).
YOLO divides the input image into a grid. If the centre of an object falls within a grid cell, that cell becomes re -
sponsible for detecting the object. Each grid cell predicts bounding boxes and confidence scores (Keita, Z – Sep
2022).
Another benefit of YOLO is that it predicts multiple bounding boxes. A crucial technique which YOLO uses to
handle these multiple bounding boxes is non-maximum suppression (NMS) which serves as a post-processing step
to enhance the accuracy of object detection (Kundu, R 2023). Bounding boxes may overlap or be located at differ-
Page 4 of 16
Siddhanth Kheria ERP 2023
ent positions, yet they all represent the same object, so NMS is utilized to identify and eliminate redundant bound-
ing boxes. (Redmon, J, 2016).
The first step divides the original image into equal grid cells. Each cell is used to detect objects as well as assign a
confidence score.
The next step is to determine the bounding boxes. At this step the algorithm accounts for the confidence score of
each cell. In figure 5 the yellow cells show that the probability that the cell contains an object is greater than 0.
Figure 5 Grid cell layout with probability greater than 0 (Author’s own - 2023)
YOLO draws the bounding boxes around the objects detected in the vector form of Y = [PC, BX, BY, BH, BW,
C1]. PC correlates with the probability score that the grid cells contain an object or not. BX and BY are the x and
y-coordinates of the centre of the bounding box. BH and BW are the heights and widths of the box and finally C1
corresponds to the class of the object, which in our case is ‘bin’. (Keita, Z – Sep 2022).
Page 5 of 16
Siddhanth Kheria ERP 2023
A single object can have multiple grid box candidates but not all of them are significant. This is where intersection
over union (IOU) comes in. The objective of IOU is to filter out irrelevant grid boxes. YOLO calculates the IOU
area of interesection
for each grid cell with the equation: (Mohamed, N.A - 2020). Finally, it disregards the
area of ∪¿ ¿
predictions which has an IOU score less than the defined threshold.
Grid box 2
Grid box 1
Figure 8 shows the intersection of two grid cells (marked in the yellow). The IOU for Grid Box 1 is smaller than
the threshold value which I have defined at 0.7 therefore it is disregarded, and we are only left with the object in
Grid Box 2.
The final step is NMS. We can use NMS to keep only the boxes which have the highest confidence score; it gets
rid of all the potentially incorrectly detected objects in the frame (Keita, Z – Sep 2022).
Labelled Data
Training a robust object detection algorithm requires a large amount of accurately labelled data and building such a
dataset is time consuming and labour intensive. To address this challenge, I investigated the minimum number of
images required per object for effective training. I used dataset sizes including, 25, 50, 200 and 500 images, to
gauge the impact of model accuracy. 150 epochs 5 was used for every single dataset and the same model. Upon
analysing the results, I discovered that the number of images used for training had a direct correlation with the
model's accuracy. With only 25 images, the algorithm struggled to accurately detect objects and false positives and
missing detections occurred often. By increasing the dataset to 50 images, the model's accuracy improved notice-
ably. However occasional false negatives and misclassifications occurred. Interestingly the model’s accuracy did
not change much between 200 images and 500 images. The 200 images dataset's diversity and the law of diminish-
5
used to describe the number of times a learning algorithm has been iterated through a dataset (alibabacloud)
Page 6 of 16
Siddhanth Kheria ERP 2023
ing returns6 likely played a role in reaching a plateau. Furthermore, the model which used 500 images to train took
almost twice. Considering the time and computational resources required for training, this becomes an important
factor, therefore, it was concluded that the 200-image dataset was sufficient to achieve satisfactory performance.
Due to impracticality of collecting hundreds of images for multiple objects, it was decided to focus on a single ob -
ject to demonstrate the proof of concept behind the idea for this specific application. 200 images of bins were
downloaded from the internet and bounding boxes around the bins were manually drawn. Images in various set -
tings were used to ensure that the model is accurate. The data was split into a 90:10 ratio, allowing the model to be
trained on majority of the data while reserving a separate portion to evaluate the accuracy and performance of the
model. Finally, this model was tested in a real-world scenario by feeding in a custom image containing a bin into
the program:
Figure 8 shows that the bins were correctly detected by the program. This program can be taken further by adding
more objects or even use a 3rd party library such as COCO which is a large dataset which contains over 330,000
images each annotated with 80 object classes. However, when this was implemented with YOLO the performance
dropped significantly, as a result, creating a custom dataset which is contains the most common objects a visually
impaired person encounters will be more computationally beneficial.
6
The law of diminishing marginal returns states that there comes a point when an additional factor of production results in a lessening
of output or impact. (Investopedia)
Page 7 of 16
Siddhanth Kheria ERP 2023
The first step of the algorithm, like most computer vision programs, is to get an input through an image in the form
of RGB (Red, Green, Blue) values which represents colours. The next step involves feature extraction which in-
cludes extracting valuable bits of information such as textures, shadows, and edges. The algorithm is then trained
on a large dataset which already has RGB images along with their corresponding heat maps. The depth maps are
the ground truth values; they provide the accurate depth information on the image. In the training process of the al-
gorithm, the model optimizes its parameters to minimize the difference between the predicted depth and the
ground truth depth (Tan, D - 2020). Once the model has undergone training, an unseen image is ready to be input-
ting. The model applies the learned mapping to the features extracted from the input image and produces an estim-
ated depth map.
Lidar sensors is a device which generates precise spatial information and the distance to a given target. It works by
sending a laser pulse and recording the time it takes for the pulse to be reflected. They are often classed as the most
efficient and accurate method for depth estimation and many autonomous vehicles & coastal mappings make use
of this technology; however, these powerful sensors come with a significant caveat: they cost upwards of $1000 7
(Carter, J - 2012). On the other hand, a machine learning based approach is far more cost effective. It can be ap -
plied to a wide range of scenarios and environments without the need for specific hardware. This significantly re-
duces the cost of implementing depth estimation systems, making them more accessible to a broader user base.
Dataset Collection
Instead of going through the time-consuming process of creating a data set from scratch the program made use of
the DIODE (Dense Indoor and Outdoor Depth) Dataset. It consisted of diverse high-resolution images with accur-
ate and dense depth measurements. The dataset has an 81GB training set and a separate 2.6GB validation dataset
(Vasiljevic, I et al - 2019). Once the data was prepared, a data pipeline was built which takes a data frame contain-
ing all the images and depth mask files. The pipeline reads and resizes the input image and processes the depth
mask files which then returns the RGB images, and the depth map images for a batch (Basu, V - 2021).
The Model
After building the data pipeline and visualising the samples, the model was built. The basic model is from U-Net
and additive skip-connections are implemented in the downscaling block. The U-Net architecture consists of an
encoder-decoder structure which is well-suited for tasks like depth estimation as it combines the contracting path
7
Data from https://www.neuvition.com/media/blog/lidar-price.html
Page 8 of 16
Siddhanth Kheria ERP 2023
(downscaling block) and the expanding path (upscaling block) to capture both local and global information in the
input image (Bharath, K - 2021) & (Tomar, N - 2021).
The downscaling block takes an image tensor8 as an input and performs a series of operations to reduce its spatial
dimensions while increasing the number of filters. 9 The constructor (__init__ method) initializes the necessary
layers and parameters for the block. The call method is the forward pass of the block. It applies two sets of con-
volutional layers (convA and convB) with batch normalization10 and leaky ReLU activation functions 11. It per-
forms element-wise addition (x += d) with the output of the first convolutional layer (d) to create a residual con-
nection. Finally, it applies max pooling to reduce the spatial dimensions of the output. The function returns both
the output after the residual connection (x) and the pooled output (p).
The upscaling block takes in the feature map tensor (x) from the previous layer and a skip connection tensor
(skip) from the corresponding down sampling block. The call method performs up sampling (us) on the input
tensor. The up sampled tensor is concatenated with the skip along the channel axis and is then passed through the
two sets of convolutional layers. The function returns the output tensor after the convolutional layers.
The program also makes use of a BottleNeckBlock function which takes a feature map tensor as an input and
applies two sets of convolutional layers. The main purpose of the bottleneck block is to reduce the spatial dimen-
sions of the input tensor while increasing the number of filters, allowing for a more efficient representation of the
data. This is achieved using 1x1, 3x3, and 1x1 convolutions, where the 1x1 convolutions are responsible for redu -
cing and then increasing the number of filters. The method returns the output tensor (Kakumani, A.K - 2022).
Loss Functions
After the model has been built, loss functions need to be defined. The model uses three loss functions: structural
similarity index (SSIM), L1 Loss and depth smoothness loss. Out of the three loss functions, SSIM contributes the
most to improving model performance by measuring the similarity between two images (Basu, V - 2021).
The SSIM index is calculated using three key components: luminance, contrast, and structure (Glew, D & Vrscay,
E.R).
The luminance of an image is its overall brightness. SSIM compares the luminance values of the pixels in the ref -
erence and distorted image. It is defined by the equation:
2 μ x μ y + C1
l ( x , y )=
μ2x + μ2y +C1
It computes the mean luminance values of the reference image ( μ x) and distorted image ( μ y ) and the variance of
2 2
their luminance values (σ x and σ y ).
8
A tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space (ht-
tps://en.wikipedia.org/wiki/Tensor)
9
A filter acts as a single template or pattern, which, when convolved across the input, finds similarities between the stored template &
different locations/regions in the input image (https://www.analyticsvidhya.com/blog/2022/01/convolutional-neural-network-an-over-
view.)
10
A method used to make training of artificial neural networks faster (https://en.wikipedia.org/wiki/Batch_normalization)
11
an activation function that introduces the property of non-linearity to a deep learning model and solves the vanishing gradients issue
(https://builtin.com/machine-learning/relu-activation-function)
Page 9 of 16
Siddhanth Kheria ERP 2023
The contrast component of the index measures the local standard deviations of pixel intensities within an image. It
calculates the standard deviations of the reference image (x) and distorted image (y), and it is defined by the equa -
tion:
2 σ x σ y + C2
c ( x , y )= 2 2
σ x + σ y +C2
The structure component captures the correlation between neighbouring pixels in an image. It involves calculating
the covariance between corresponding pixels in the reference and distorted images. It is defined by the equation:
σ xy +C 3
s ( x , y )=
σ x σ y +C3
In all the equations, regularisation constants are used ([c1 c2 c3]) to prevent instability in the image regions
where the local mean or standard deviation is near zero. To ensure the equation follows the conventions of math -
ematics, a small non-zero value is used for these constants (Mathworks - 2021).
The exponent values are less than 1 to account for non-linearities. The final SSIM index ranges from -1 to 1, where
a value of 1 suggests a perfect match between the two images (MathWorks - 2021).
Tracking loss function metrics is important in machine learning as it provides information of how well the model
is performing and what needs to be done to optimise it further. When the loss is minimized on the training set, it
encourages the model to capture meaningful patterns in the data, rather than memorizing the training examples
(Saravanan, P - 2021).
Model Training
After defining the loss functions, the final step was training the model. The program used the following hyperpara-
meters: LR = 0.0002, EPOCHS = 30 and BATCH-SIZE = 32 (Basu, V - 2021).
The learning rate (LR) The learning rate determines the step size at which the model's parameters are updated dur -
ing training. It affects the speed of convergence (how quickly the model reaches its limit in accuracy) and the sta-
bility of the optimization process (Careerera - 2022). A larger learning rate may lead to faster convergence but can
also cause overshooting (Google Machine Learning Crash Course - 2022).
An epoch refers to one complete pass through the entire training dataset during training. Increasing the number of
epochs can potentially improve the model's performance, but too many epochs can lead to overfitting.
The batch size determines the number of samples processed before the model's parameters are updated (Varghese,
R - 2023). A larger batch size can result in faster training, but it may require more memory. Smaller batch sizes
can offer more stochasticity and generalize better but may lead to slower convergence (Google Machine Learning
Crash Course - 2022).
Page 10 of 16
Siddhanth Kheria ERP 2023
After the model had been trained, it was tested in a real-world scenario where the same image of the bins was in-
putted into the program.
Figure 11- Input and output of the depth estimation algorithm (Author’s Own - 2023)
In Figure 11, the image on the left, it is obvious to tell that the bin on the left is further behind than the bin on the
right and this spatial relationship is accurately shown in a heat-map form in the image on the right. The bin on the
right appears brighter with a lighter shade of green which indicates its closer proximity. The heat map data can be
converted into auditory feedback and different pitch, volume, or tones can be assigned to different depth levels.
The blind person can listen to this feedback and perceive the depth based on the sound cues.
Page 11 of 16
Siddhanth Kheria ERP 2023
Since the program relies on a 3rd party OCR python package, it utilises an application package interface (API) 12re-
quest. The process begins with an input image, which undergoes pre-processing. Then, this image is fed into an
OCR engine which has been trained on a large data set and is processed by an open-source library called Lepton-
ica. This engine generates a textual output which is then returned as a response (Figure 11) (Parthasarathy, B -
2018).
To typically recognise a character a CNN is used, however as most of the use cases includes an arbitrary sequence
of characters these are recognised by building recurrent neural networks (RNN) and long-short term memories
(LSTMs) which is popular form of RNN (Zelic, F & Sable, A - 2023).
The fundamental idea behind LSTM is to address the vanishing gradient problem. The vanishing gradient problem
refers to the issue where gradients, which carry information about how to update the network's parameters during
training, diminish exponentially as they are backpropagated through many time steps. This makes it difficult for
the network to learn long-term patterns in the data (Wang, C-F - 2019).
12
Enables two software components to communicate with each other using a set of definitions and protocols (Amazon AWS)
Page 12 of 16
Siddhanth Kheria ERP 2023
LSTM tackles the vanishing gradient problem by introducing a memory cell, which is responsible for storing and
propagating information across time steps (Hochreiter, S & Schmidhuber, J - 1997). During the training process,
the parameters are learned by minimizing a loss function through backpropagation and gradient descent. By utiliz-
ing LSTMs, OCR systems can learn to recognize and interpret the sequential patterns in characters, allowing them
to accurately transcribe text from images (Olah, C - 2015).
After pre-processing and inputting the image into PyTesseract, it correctly outputted the text in the image.
This output can then be converted into speech which then enables blind individuals to access and consume written
information. Furthermore, OCR can be used to convert printed text into Braille 13 which will allow people with vis-
ion impairment to read the content using Braille displays which facilitates access to textual information in a format
that is already very familiar to these individuals.
13
It is a system of reading and writing a specific language without the use of sight (https://brailleworks.com/braille-resources/what-is-
braille/)
Page 13 of 16
Siddhanth Kheria ERP 2023
Conclusion
This paper outlines the challenges faced by individuals with vision impairments in navigating their environment
and proposes a promising solution using computer vision techniques. My volunteering experience at the blind
school and the analysis carried out during that time have strengthened my commitment to creating a solution. Wit-
nessing the dedication of the staff in helping children navigate the classroom and adapting existing learning mater-
ials into specialised resources, I recognized the need for innovative methods to enhance accessibility and improve
the quality of life for the blind population. Through careful evaluation of existing technological solutions, I identi-
fied key areas that required enhancement and to address these challenges, I developed a solution that makes use of
machine learning and through the use of audio-based information delivery, it opens new possibilities for blind indi-
viduals to access and engage with written content as well as get feedback on their environment through object de-
tection and depth estimation so they can safely navigate their surroundings.
However, there is a concern that visually impaired people will become too reliant on this technology and neglect
practicing important skills. A possible solution to address this issue is to activate the assistive technology program
selectively, based on the specific needs and context of the individual. For instance, blind individuals can choose to
turn off the program when they are in familiar environments such as their homes, where they have already de-
veloped a good sense of spatial awareness, by doing so, they can maintain their proficiency in other important
skills like braille reading and echolocation. This approach encourages blind individuals to continue practising and
honing their abilities, while also utilising technology as a valuable tool in unfamiliar or challenging public environ-
ments where additional support may be necessary. This program should be viewed as a tool that provides addi -
tional support, rather than a device which individuals become overly dependent on. Blind individuals need to re-
cognize it as an aid that complements their existing skills and abilities.
Machine learning and computer vision applications could greatly enhance the quality of life, foster independence,
provide equal educational and employment opportunities and more - providing even greater support to empower
and assist this community in future years.
Page 14 of 16
Siddhanth Kheria ERP 2023
Bibliography
(1) Bourne, R.R.A et al. (2020); “Global Prevalence of Blindness and Distance and Near Vision Impairment
in 2020: progress towards the Vision 2020 targets and what the future holds.” [Accessed 3/1/23]
(2) Baxter, K & Beresford B. (2016); “A Review of Methods of Evaluation and Outcome Measurement of a
Complex Intervention in Social Care: The Case of Assistance Dogs”, pp 10-11; [Accessed 3/1/23]
(3) Ali, S. (2023); “Unveiling the Power of Object Detection: Revolutionizing Visual Perception”; [Accessed
15/2/23]
(4) Kundu, R. (2023) “YOLO: Algorithm for Object Detection Explained [+Examples]”. [Accessed 6/6/23]
(5) Gandhi, R. “R-CNN, Fast R-CNN, Faster R-CNN, YOLO — Object Detection Algorithms”; [Accessed
24/5/23]
(6) Redmon, J et al. (2016) “You Only Look Once: Unified, Real-Time Object Detection”; pp 2-5; [Accessed
19/4/23]
(7) Keita, Z. (2022) “YOLO Object Detection Explained. Understand YOLO object detection, its benefits,
how it has evolved over the last couple of years and some real-life applications.” [Accessed 11/5/23]
(8) Mohamed, N.A. (2020) “Moving object detection via TV-L1 optical flow in fall-down videos”. [Accessed
11/5/23]
(9) Tan, D. (2020) “Depth Estimation: Basics and Intuition”. [Accessed 14/5/23]
(10) Carter, J et al. (2012) “Lidar 101: An Introduction to Lidar Technology, Data, and Applications”. pp 1-3
& 9-12 [Accessed 4/6/23]
(11) Vasiljevic, I et al. (2019) “DIODE: A Dense Indoor and Outdoor Depth Dataset”. [Date Accessed
29/4/23]
(12) Basu, V. (2021) “Monocular depth estimation. Implement a depth estimation model with a convnet”. [Ac-
cessed 29/4/23]
(13) Bharath, K. (2021) “U-Net Architecture for Image Segmentation”. Link can be found at https://blog.pa-
perspace.com/unet-architecture-image-segmentation/ [Accessed 30/4/23]
(14) Tomar, N. (2021) “What is UNET?”. Link can be found at https://medium.com/analytics-vidhya/what-is-
unet-157314c87634 [Accessed 30/4/23]
(15) Kakumani, A.K (2022) “BRB U-Net: Bottleneck Residual Blocks in U-Net for Light-Weight Semantic
Segmentation”. [Accessed 3/5/2023]
(16) Glew, D & Vrscay, E.R “Max and min values of the structural similarity (SSIM) function S(x, a) on the
L2 sphere SR(a), a ∈ RN” [Accessed 11/5/23]
(17) Mathworks (2021) “(SSIM) index for measuring image quality”. Link can be found at https://www.math-
works.com/help/images/ref/ssim.html [Accessed 11/5/23]
(18) Saravanan, P. (2021) “Understanding Loss Functions in Machine Learning”. [Accessed 12/5/23]
(19) Careerera. (2022) “What is convergence theory in Machine Learning | Convergence in gradient descent |
Machine Learning”. Video link can be found at https://www.youtube.com/watch?v=2QCagwYlVaI [Ac-
cessed 10/5/23]
(20) Google Machine Learning Crash Course. (2022) “Reducing Loss” [Accessed 4/4/23]
(21) Varghese, R et al. (2023) International Journal for research in applied science & engineering technology”.
Volume 11, Issue V. [Accessed 29/5/23]
(22) Imtiaz, H. (2020) “A Beginners Guide to Tesseract OCR Using Pytesseract”. [Accessed 16/5/23]
(23) Parathasarathy, B. (2018) “Build your own OCR(Optical Character Recognition) for free”. [Accessed
17/5/23]
(24) Wang, C-F. (2019) “The Vanishing Gradient Problem, The Problem, Its Causes, Its Significance, and Its
Solutions”. [Accessed 22/5/23]
(25) Hochreiter, S & Schmidhuber, J. (1997) “LONG SHORT-TERM MEMORY”. Link can be found at
http://www.bioinf.jku.at/publications/older/2604.pdf [Accessed 22/5/23]
(26) Olah, C. (2015) “Understanding LSTM Networks”. Link can be found at http://colah.github.io/posts/
2015-08-Understanding-LSTMs/?ref=nanonets.com [Accessed 24/5/23]
Page 15 of 16
Siddhanth Kheria ERP 2023
(27) Preuhs, E. (2022) “Acquisition and Reconstruction Methods for Multidimensional and Quantitative Mag-
netic Resonance Imaging”. [Accessed 4/6/23]
Page 16 of 16