Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
291 views

Deep Learning Notes

Deep learning is a branch of machine learning that uses neural networks to learn from large amounts of data. It can be used for tasks like image classification, natural language processing, and more. The key ingredients for deep learning are datasets, loss functions, model architectures, and optimization methods.

Uploaded by

barak
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
291 views

Deep Learning Notes

Deep learning is a branch of machine learning that uses neural networks to learn from large amounts of data. It can be used for tasks like image classification, natural language processing, and more. The key ingredients for deep learning are datasets, loss functions, model architectures, and optimization methods.

Uploaded by

barak
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 71

Theory of Deep Learning

UNIT-1
Deep learning is a branch of machine learning which is based on artificial neural
networks. It is capable of learning complex patterns and relationships within
data. In deep learning, we don’t need to explicitly program everything. It has
become increasingly popular in recent years due to the advances in processing
power and the availability of large datasets. Because it is based on artificial
neural networks (ANNs) also known as deep neural networks (DNNs). These
neural networks are inspired by the structure and function of the human brain’s
biological neurons, and they are designed to learn from large amounts of data.
1. Deep Learning is a subfield of Machine Learning that involves the use of
neural networks to model and solve complex problems. Neural networks are
modeled after the structure and function of the human brain and consist of
layers of interconnected nodes that process and transform data.
2. The key characteristic of Deep Learning is the use of deep neural networks,
which have multiple layers of interconnected nodes. These networks can
learn complex representations of data by discovering hierarchical patterns
and features in the data. Deep Learning algorithms can automatically learn
and improve from data without the need for manual feature engineering.
3. Deep Learning has achieved significant success in various fields, including
image recognition, natural language processing, speech recognition, and
recommendation systems. Some of the popular Deep Learning architectures
include Convolutional Neural Networks (CNNs), Recurrent Neural Networks
(RNNs), and Deep Belief Networks (DBNs).
4. Training deep neural networks typically requires a large amount of data and
computational resources. However, the availability of cloud computing and
the development of specialized hardware, such as Graphics Processing
Units (GPUs), has made it easier to train deep neural networks.
In summary, Deep Learning is a subfield of Machine Learning that involves the
use of deep neural networks to model and solve complex problems. Deep
Learning has achieved significant success in various fields, and its use is
expected to continue to grow as more data becomes available, and more
powerful computing resources become available.

What is Deep Learning?


Deep learning is the branch of machine learning which is based on artificial
neural network architecture. An artificial neural network or ANN uses layers of
interconnected nodes called neurons that work together to process and learn
from the input data.
In a fully connected Deep neural network, there is an input layer and one or
more hidden layers connected one after the other. Each neuron receives input
from the previous layer neurons or the input layer. The output of one neuron
becomes the input to other neurons in the next layer of the network, and this
process continues until the final layer produces the output of the network. The
layers of the neural network transform the input data through a series of
nonlinear transformations, allowing the network to learn complex
representations of the input data.

Applications of Deep Learning :


The main applications of deep learning can be divided into computer vision,
natural language processing (NLP), and reinforcement learning.
Computer vision
In computer vision, Deep learning models can enable machines to identify and
understand visual data. Some of the main applications of deep learning in
computer vision include:
 Object detection and recognition: Deep learning model can be used to
identify and locate objects within images and videos, making it possible for
machines to perform tasks such as self-driving cars, surveillance, and
robotics.
 Image classification: Deep learning models can be used to classify images
into categories such as animals, plants, and buildings. This is used in
applications such as medical imaging, quality control, and image retrieval.
 Image segmentation: Deep learning models can be used for image
segmentation into different regions, making it possible to identify specific
features within images.
 Automatic Text Generation – Deep learning model can learn the corpus
of text and new text like summaries, essays can be automatically
generated using these trained models.
 Language translation: Deep learning models can translate text from one
language to another, making it possible to communicate with people from
different linguistic backgrounds.
 Sentiment analysis: Deep learning models can analyze the sentiment of
a piece of text, making it possible to determine whether the text is
positive, negative, or neutral. This is used in applications such as
customer service, social media monitoring, and political analysis.
 Speech recognition: Deep learning models can recognize and
transcribe spoken words
Advantages of Deep Learning:

1. High accuracy: Deep Learning algorithms can achieve state-of-the-art


performance in various tasks, such as image recognition and natural
language processing.
2. Automated feature engineering: Deep Learning algorithms can automatically
discover and learn relevant features from data without the need for manual
feature engineering.
3. Scalability: Deep Learning models can scale to handle large and complex
datasets, and can learn from massive amounts of data.
4. Flexibility: Deep Learning models can be applied to a wide range of tasks
and can handle various types of data, such as images, text, and speech.
5. Continual improvement: Deep Learning models can continually improve their
performance as more data becomes available.

Ingredients of Deep Learning

Dataset

The dataset is the first ingredient in training a neural network — the data itself along with the
problem we are trying to solve define our end goals. For example, are we using neural networks
to perform a regression analysis to predict the value of homes in a specific suburb in 20 years? Is
our goal to perform unsupervised learning, such as dimensionality reduction? Or are we trying to
perform classification?

In the context of this book, we’re strictly focusing on image classification; however, the
combination of your dataset and the problem you are trying to solve influences your choice in
loss function, network architecture, and optimization method used to train the model. Usually,
we have little choice in our dataset (unless you’re working on a hobby project) — we are given a
dataset with some expectation on what the results from our project should be. It is then up to us
to train a machine learning model on the dataset to perform well on the given task.
Loss Function

Given our dataset and target goal, we need to define a loss function that aligns with the problem
we are trying to solve. In nearly all image classification problems using deep learning, we’ll be
using cross-entropy loss. For >2 classes we call this categorical cross-entropy. For two class
problems, we call the loss binary cross-entropy.

Model/Architecture

Your network architecture can be considered the first actual “choice” you have to make as an
ingredient. Your dataset is likely chosen for you (or at least you’ve decided that you want to
work with a given dataset). And if you’re performing classification, you’ll in all likelihood be
using cross-entropy as your loss function.

However, your network architecture can vary dramatically, especially when with which
optimization method you choose to train your network. After taking the time to explore your
dataset and look at:

1. How many data points you have.

2. The number of classes.

3. How similar/dissimilar the classes are.

4. The intra-class variance.

You should start to develop a “feel” for a network architecture you are going to use. This takes
practice as deep learning is part science, part art.

Keep in mind that the number of layers and nodes in your network architecture (along with any
type of regularization) is likely to change as you perform more and more experiments. The more
results you gather, the better equipped you are to make informed decisions on which techniques
to try next.
Optimization Method

Even despite all these newer optimization methods, SGD is still the workhorse of deep learning
— most neural networks are trained via SGD, including the networks obtaining state-of-the-art
accuracy on challenging image datasets such as ImageNet.

When training deep learning networks, especially when you’re first getting started and learning
the ropes, SGD should be your optimizer of choice. You then need to set a proper learning rate
and regularization strength, the total number of epochs the network should be trained for, and
whether or not momentum (and if so, which value) or Nesterov acceleration should be used.
Take the time to experiment with SGD as much as you possibly can and become comfortable
with tuning the parameters.

Becoming familiar with a given optimization algorithm is similar to mastering how to drive a car
— you drive your own car better than other people’s cars because you’ve spent so much time
driving it; you understand your car and its intricacies. Oftentimes, a given optimizer is chosen to
train a network on a dataset not because the optimizer itself is better, but because the driver (i.e.,
deep learning practitioner) is more familiar with the optimizer and understands the “art” behind
tuning its respective parameters.

Keep in mind that obtaining a reasonably performing neural network on even a small/medium
dataset can take 10’s to 100’s of experiments even for advanced deep learning users — don’t be
discouraged when your network isn’t performing extremely well right out of the gate. Becoming
proficient in deep learning will require an investment of your time and many experiments — but
it will be worth it once you master how these ingredients come together.
A Formal Proof of the Expressiveness of Deep Learning
Deep learning has had a profound impact on computer science in recent years, with applications to
image recognition, language processing, bioinformatics, and more. Recently, Cohen et al. provided
theoretical evidence for the superiority of deep learning over shallow learning. We formalized their
mathematical proof using Isabelle/HOL. The Isabelle development simplifies and generalizes the original
proof, while working around the limitations of the HOL type system. To support the formalization, we
developed reusable libraries of formalized mathematics, including results about the matrix rank, the
Borel measure, and multivariate polynomials as well as a library for tensor analysis.
Isabelle [33,34] is a generic proof assistant that supports many object logics. The metalogic is based on
an intuitionistic fragment of Church’s simple type theory [14]. The types are built from type variables α,
β, … and n-ary type constructors, normally written in postfix notation (e.g., α list). The infix type
constructor α ⇒ β is interpreted as the (total) function space from α to β. Function applications are
written in a curried style (e.g., f x y). Anonymous functions x → yx are written λx. yx. The notation t :: τ
indicates that term t has type τ . Isabelle/HOL is an instance of Isabelle. Its object logic is classical higher-
order logic supplemented with rank-1 polymorphism and Haskell-style type classes. The distinction
between the metalogic and the object logic is important operationally but not semantically. Isabelle’s
architecture follows the tradition of the theorem prover LCF [21] in implementing a small inference
kernel that verifies the proofs. Trusting an Isabelle proof involves trusting this kernel, the formulation of
the main theorems, the assumed axioms, the compiler and runtime system of Standard ML, the
operating system, and the hardware. Specification mechanisms help us define important classes of types
and functions, such as inductive datatypes and recursive functions, without introducing axioms. Since
additional axioms can lead to inconsistencies, it is generally good style to use these mechanisms. Our
formalization is mostly written in Isar [43], a proof language designed to facilitate the development of
structured, human-readable proofs. Isar proofs allow us to state intermediate proof steps and to nest
proofs. This makes them more maintainable than unstructured tactic scripts, and hence more
appropriate for substantial formalizations. Isabelle locales are a convenient mechanism for structuring
large proofs. A locale fixes types, constants, and assumptions within a specified scope. For example, an
informal mathematical text stating “in this section, let A be a set of natural numbers and B a subset of A”
could be formalized by introducing a locale AB_subset as follows:

locale AB_subset = fixes A B :: nat set assumes B ⊆ A

Mathematical Preliminaries

We provide a short introduction to tensors and the Lebesgue measure. We expect familiarity with basic
matrix and polynomial theory. 3.1 Tensors Tensors can be understood as multidimensional arrays, with
vectors and matrices as the oneand two-dimensional cases. Each index corresponds to a mode of the
tensor. For matrices, the modes are called “row” and “column”. The number of modes is the order of
the tensor. The number of values an index can take in a particular mode is the dimension in that mode.
Thus, a real-valued tensor A ∈ RM1×···×MN of order N and dimension Mi in mode i contains values
Ad1,...,dN ∈ R for di ∈ {1,..., Mi}. Like for vectors and matrices, addition + is defined as componentwise
addition for tensors of identical dimensions. The product ⊗ : RM1×···×MN × RMN +1×···×MN −→
RM1×···×MN is a binary operation that generalizes the outer vector product. For real tensors, it is
associative and distributes over addition. The canonical polyadic rank, or CP-rank, associates a natural
number with a tensor, generalizing the matrix rank. The matricization [A ] of a tensor A is a matrix
obtained by rearranging A ’s entries using a bijection between the tensor and matrix entries. It has the
following property:

Lemma 1 Given a tensor A , we have rank [A ] ≤ CP-rank A.


Lebesgue Measure

The Lebesgue measure is a mathematical description of the intuitive concept of length, surface, or
volume. It extends this concept from simple geometrical shapes to a large amount of subsets of Rn,
including all closed and open sets, although it is impossible to design a measure that caters for all
subsets of Rn while maintaining intuitive properties. The sets to which the Lebesgue measure can assign
a volume are called measurable. The volume that is assigned to a measurable set can be a nonnegative
real number or ∞. A set of Lebesgue measure 0 is called a null set. If a property holds for all points in Rn
except for a null set, the property is said to hold almost everywhere

The following lemma [13] about polynomials will be useful for the proof of the fundamental theorem of
network capacity.

Lemma 2 If p ≡ 0 is a polynomial in d variables, the set of points x ∈ Rd with p(x) = 0 is a Lebesgue null
set

Data for Deep Learning


The minimum requirements to successfully apply deep learning depends on the
problem you’re trying to solve. In contrast to static, benchmark datasets like
MNIST and CIFAR-10, real-world data is messy, varied and evolving, and that is
the data practical deep learning solutions must deal with.

Types of Data
Deep learning can be applied to any data type. The data types you work with,
and the data you gather, will depend on the problem you’re trying to solve.

1. Sound (Voice Recognition)


2. Text (Classifying Reviews)
3. Images (Computer Vision)
4. Time Series (Sensor Data, Web Activity)
5. Video (Motion Detection)

Use Cases
Deep learning can solve almost any problem of machine perception, including
classifying data , clustering it, or making predictions about it.

 Classification: This image represents a horse; this email looks like spam; this transaction is
fraudulent
 Clustering: These two sounds are similar. This document is probably what user X is looking
for
 Predictions: Given their web log activity, Customer A looks like they are going to stop using
your service

Deep learning is best applied to unstructured data like images, video, sound or
text. An image is just a blob of pixels, a message is just a blob of text. This data
is not organized in a typical, relational database by rows and columns. That
makes it more difficult to specify its features manually.

Common use cases for deep learning include sentiment analysis, classifying
images, predictive analytics, recommendation systems, anomaly detection and
more.

If you are not sure whether deep learning makes sense for your use case,
please get in touch.

Data Attributes
For deep learning to succeed, your data needs to have certain characteristics.

Relevancy
The data you use to train your neural net must be directly relevant to your
problem; that is, it must resemble as much as possible the real-world data you
hope to process. Neural networks are born as blank slates, and they only learn
what you teach them. If you want them to solve a problem involving certain kinds
of data, like CCTV video, then you have to train them on CCTV video, or
something similar to it. The training data should resemble the real-world data that
they will classify in production.

Proper Classification
If a client wants to build a deep-learning solution that classifies data, then they
need to have a labeled dataset. That is, someone needs to apply labels to the
raw data: “This image is a flower, that image is a panda.” With time and tuning,
this training dataset can teach a neural network to classify new images it has not
seen before.

Formatting
Neural networks eat vectors of data and spit out decisions about those vectors.
All data needs to be vectorized, and the vectors should be the same length when
they enter the neural net. To get vectors of the same length, it’s helpful to have,
say, images of the same size (the same height and width). So sometimes you
need to resize the images. This is called data pre-processing.

Accessibility
The data needs to be stored in a place that’s easy to work with. A local file
system, or HDFS (the Hadoop file system), or an S3 bucket on AWS, for
example. If the data is stored in many different databases that are unconnected,
you will have to build data pipelines. Building data pipelines and performing
preprocessing can account for at least half the time you spend building deep-
learning solutions.

Minimum Data Requirement


The minimums vary with the complexity of the problem, but 100,000 instances in
total, across all categories, is a good place to start.

If you have labeled data (i.e. categories A, B, C and D), it’s preferable to have an
evenly balanced dataset with 25,000 instances of each label; that is, 25,000
instances of A, 25,000 instances of B and so forth.

Curse of Dimensionality

The Curse of Dimensionality sounds like it’s something straight out of the wizarding world, but it
really is only a very common term you’ll come across in the Machine Learning and Big Data
world. This term describes how the increase in input data dimensions results in an exponential
increase in computational expense and efforts required for processing and analyzing that data.

The curse of dimensionality basically refers to the difficulties a machine


learning algorithm faces when working with data in the higher dimensions, that did not
exist in the lower dimensions. This happens because when you add dimensions
(features), the minimum data requirements also increase rapidly.
This means, that as the number of features (columns) increases, you need an
exponentially growing number of samples (rows) to have all combinations of feature
values well-represented in our sample.
With the increase in the data dimensions, your model –
 would also increase in complexity.
 would become increasingly dependent on the data it is being trained on.

This leads to overfitting of the model, so even though the model performs really well on
training data, it fails drastically on any real data.
Quite a few algorithms work only on tall, svelte datasets with fewer features and more
samples. Hence, to remove the curse afflicting your model, you might need to put your
data on a diet – i.e., reduce its dimensions through feature selection and feature
engineering techniques.
Dimensionality Reduction to the Rescue
What you need to understand first is that data features are usually correlated. Hence,
the higher dimensional data is dominated by a rather small number of features. If we
can find a subset of the superfeatures that can represent the information just as well as
the original dataset, we can remove the curse of dimensionality!
This is what dimensionality reduction is – a process of reducing the dimension of your
data to a few principal features.
Fewer input dimensions often correspond to a simpler model, referred to as its degrees
of freedom. A model with larger degrees of freedom is more prone to overfitting. So, it
is desirable to have more generalized models, and input data with fewer features.
Why is Dimensionality Reduction necessary?
 Avoids overfitting – the lesser assumptions a model makes, the simpler it will be.
 Easier computation – the lesser the dimensions, the faster the model trains.
 Improved model performance – removes redundant features and noise, lesser misleading data
improves model accuracy.
 Lower dimensional data requires less storage space.
 Lower dimensional data can work with other algorithms that were unfit for larger dimensions.
How is Dimensionality Reduction done?
Several techniques can be employed for dimensionality reduction depending on the
problem and the data. These techniques are divided into two broad categories:
Feature Selection: Choosing the most important features from the data
Matrix Factorization
Matrix factorization is one of the most sought-after machine learning recommendation
models. It acts as a catalyst, enabling the system to gauge the customer’s exact
purpose of the purchase, scan numerous pages, shortlist, and rank the right product or
service, and recommend multiple options available. Once the output matches the
requirement, the lead translates into a transaction and the deal clicks.

What is Matrix Factorization?


This mathematical model helps the system split an entity into multiple smaller entries,
through an ordered rectangular array of numbers or functions, to discover the features
or information underlying the interactions between users and items.

Where is Matrix Factorization used?


Once an individual raises a query on a search engine, the machine deploys uses matrix
factorization to generate an output in the form of recommendations. The system uses
two approaches– content-based filtering and collaborative filtering- to make
recommendations.

Content-Based Filtering
This approach recommends items based on user preferences. It matches the
requirement, considering the past actions of the user, patterns detected, or any explicit
feedback provided by the user, and accordingly, makes a recommendation.

Example: If you prefer the chocolate flavor and purchase a chocolate ice cream, the
next time you raise a query, the system shall scan for options related to chocolate, and
then, recommend you to try a chocolate cake.
How does the System make recommendations?

Let us take an example. To purchase a car, in addition to the brand name, people check
for features available in the car, most common ones being safety, mileage, or aesthetic
value. Few buyers consider the automatic gearbox, while others opt for a combination of
two or more features. To understand this concept, let us consider a two-dimensional
vector with the features of safety and mileage.

1. In the above graph, on the left-hand side, we have cited individual preferences,
wherein 4 individuals have been asked to provide a rating on safety and mileage. If
the individuals like a feature, then we assign the value 1, and if they do not like that
particular feature, we assign 0. Now we can see that Persons A and C prefer safety,
Person B chooses mileage and Person D opts for both Safety and Mileage.
2. Cars are been rated based on the number of features (items) they offer. A ranking of
4 implies high features, and 1 depicts fewer features.
• A ranks high on safety and low on mileage
• B is rated high on mileage and low on safety
• C has an average rating and offers both safety and mileage
• D is low on both mileage and safety
3. The blue colored ? mark is the sparse value, wherein either person does not know
about the car or is not part of the consideration list for buying the car or has forgotten
to rate.
4. Let’s understand how the matrix at the center has arrived. This matrix represents the
overall rating of all 4 cars, given by the individuals. Person A has given an overall
rating of 4 for Car A, 1 to Cars B and D, and 2 to Car C. This has been arrived,
through the following calculations-
 For Car A = (1 of safety preference x 4 of safety features) +(0 of mileage preference
x 1 of mileage feature) = 4;
 For Car B = (1 of safety preference x 1 of safety features) +(0 of mileage preference
x 4 of mileage feature) = 1;
 For Car C = (1 of safety preference x 2 of safety features) +(0 of mileage preference
x 2 of mileage feature) = 2;
 For Car D = (1 of safety preference x 1 of safety features) +(0 of mileage preference
x 2 of mileage feature) = 1
Now on the basis of the above calculations, we can predict the overall rating for each
person and all the cars.

If person C is asking the search engine to recommend options available for cars, basis
the content available, and his preference, the machine will compute the table below:

 It will rank cars based on overall rating.


 It will recommend car A, followed by car C.
Also Read: What is Machine Learning?

Mathematics behind the recommendations made using content-based filtering

In the above example, we had two matrices, that is, individual preferences and car
features, and 4 observations to enable the comparison. Now, if there are ‘n’ number of
observations in both matrix a and b, then-

 Dot Product
The dot product of two length- n vectors a and b are defined as:
The dot product is only defined for vectors of the same length, which means there
should be equal number of ‘n’ observations.

 Cosine
There are alternative approaches or algorithms available to solve a problem. Cosine
is used as an approach to measure similarities. It is used to determine the nearest
user to provide recommendations. In our chocolate example above, the search
engine can make such a recommendation using cosine. It measures similarities
between two non zero vectors. The similarity between two vectors a and b is, in fact,
the ratio between their dot product and the product of their magnitudes.

By applying the definition of similarity, this will equal 1 if the two vectors are identical,
and it will be 0, if the two are orthogonal. In other words, similarity is a number ranging
from 0 to 1 and tells us the extent of similarity between the two vectors. Closer to 1 is
highly similar and closer to 0 is dissimilar.

 TF-IDF (Term frequency and Inverse document frequency)


Let us assume the matrix only has text or character elements, then the search
engine uses TF- IDF algorithm to find similarities and make recommendations. Each
word or term has its respective TF and IDF score. The product of the TF and IDF
scores of a term is called the TF-IDF weight of that term.

The TF (term frequency) of a word is the number of times it appears in a document.


When you know it, you can find out if a term is being used too often or too rarely.
The IDF (inverse document frequency) of a word is the measure of how significant that
term is in the whole corpus.

The higher the TF-IDF score (weight), more important the term, and vice versa.

Advantages of content-based filtering

The model does not require any data about other users, since the recommendations are
specific to one user. This makes it easier to scale it up to a large number of users.
a) The model can capture specific interests of a user, and can recommend niche items
that
very few other users are interested in.

Disadvantages of content-based filtering

a) Since the feature representation of the items are hand-engineered to some extent,
this technique requires a lot of domain knowledge. Therefore, the model can only be as
good as the hand-engineered features.

b) The model can only make recommendations based on existing interests of the user.
In other words, the model has limited ability to expand on the users’ existing interests.

Collaborative Filtering
This approach uses similarities between users and items simultaneously, to provide
recommendations. It is the idea of recommending an item or making a prediction,
depending on other like-minded individuals. It could comprise a set of users, items,
opinions about an item, ratings, reviews, or purchases.
Example: Suppose Persons A and B both like the chocolate flavor and have them have
tried the ice-cream and cake, then if Person A buys chocolate biscuits, the system will
recommend chocolate biscuits to Person B.

Types of collaborative filtering

Collaborative filtering is classified under the memory-based and model-based


approaches:
How does system make recommendations in collaborative filtering?

In collaborative filtering, we do not have the rating of individual preferences and car
preferences. We only have the overall rating, given by the individuals for each car. As
usual, the data is sparse, implying that the person either does not know about the car or
is not under the consideration list for buying the car, or has forgotten to give the rating.

The task in hand is to predict the rating that Person C might assign to Car C (? marked
in yellow) basis the similarities in ratings given by other individuals. There are three
steps involved to arrive at a collaborative filtering recommendation.
Step-1

Normalization: Usually while assigning a rating, individuals tend to give either a high
rating or a low rating, across all parameters. Normalization usually helps in balancing
and evens out such measures. This is done by taking an average of rating available and
subtracting it with the individual rating(x- x̅ )
a) In case of Person A = (4+1+2+1)/4 = 2 = x̅ , In case of Person B = (1+4+2)/3 = 2.3 =̅x
Similarly, we can do it for Persons C and D.
b) Then we subtract the average with individual rating
In case of Person A it is, 4-2, 1-2, 2-2, 1-2, In case for Person B = 1-2.3,4-2.3,2-2.3
Similarly, we can do it for Persons C and D.
You get the below table for all the individuals.

If you add all the numbers in each row, then it will add up-to zero. Actually, we have
centered overall individual rating to zero. From zero, if individual rating for each car is
positive, it means he likes the car, and if it is negative it means he does not like the car.

Step-2

Similarity measure: As discussed in content-based filtering, we find the similarities


between two vectors a and b as the ratio between their dot product and the product of
their magnitudes.
Step-3

Neighborhood selection: Here in Car C column, we find maximum similarities


between ratings assigned by Persons A and D. We use these similarities to arrive at the
prediction.
Review of Matrix factorization techniques
Recommender systems are applications that suggest items or services to users based on their preferences,
behavior, or feedback. For example, Netflix recommends movies, Amazon recommends products, and
Spotify recommends songs. One of the common techniques for building recommender systems is matrix
factorization, which involves decomposing a large matrix of user-item ratings into smaller matrices of
user features and item features. In this article, you will learn what are some advantages and disadvantages
of using matrix factorization for recommender systems.

1How matrix factorization works

Matrix factorization assumes that each user and item can be represented by a vector of latent features,
such as genre, style, quality, or popularity. These features capture the underlying patterns and preferences
of the users and items. For example, a user who likes action movies may have a high value for the action
feature, while a movie that is a comedy may have a high value for the comedy feature. By multiplying the
user feature vector and the item feature vector, we can obtain an estimate of the rating that the user would
give to the item. The goal of matrix factorization is to find the optimal feature vectors that minimize the
difference between the estimated ratings and the actual ratings.

2Advantages of matrix factorization

Matrix factorization has several advantages for recommender systems. First, it can handle sparse and
incomplete data, which is often the case for user-item ratings. Matrix factorization can fill in the missing
values and predict ratings for unseen items or users. Second, it can reduce the dimensionality and
complexity of the data, which can improve the efficiency and scalability of the recommender system.
Matrix factorization can compress a large matrix into smaller matrices that capture the essential
information and reduce the noise. Third, it can discover latent features and patterns that are not obvious or
explicit in the data. Matrix factorization can reveal hidden similarities and preferences among users and
items, which can enhance the quality and diversity of the recommendations.

3Disadvantages of matrix factorization

Matrix factorization also has some disadvantages for recommender systems. First, it can suffer from
overfitting and underfitting, which can affect the accuracy and generalization of the recommendations.
Overfitting occurs when the feature vectors fit the data too well and capture the noise or outliers, while
underfitting occurs when the feature vectors fit the data too poorly and miss the important information. To
avoid these problems, matrix factorization needs to balance the trade-off between fitting the data and
regularizing the feature vectors. Second, it can be sensitive to the choice of parameters, such as the
number of features, the learning rate, and the regularization term. These parameters can influence the
performance and convergence of the matrix factorization algorithm. To find the optimal parameters,
matrix factorization needs to perform cross-validation or grid search, which can be time-consuming and
computationally expensive. Third, it can be limited by the linearity and independence assumptions, which
may not hold for some data or scenarios. Matrix factorization assumes that the rating is a linear
combination of the features, and that the features are independent of each other. However, in some cases,
the rating may depend on nonlinear or interactive features, or on external factors such as context, time, or
social influence. To address these limitations, matrix factorization may need to incorporate additional
information or techniques, such as nonlinear functions, feature engineering, or hybrid models.

Supervised learning and unsupervised learning


Machine learning is a field of computer science that gives computers the ability to learn without being
explicitly programmed. Supervised learning and unsupervised learning are two main types of machine
learning.

In supervised learning, the machine is trained on a set of labeled data, which means that the input data is
paired with the desired output. The machine then learns to predict the output for new input data.
Supervised learning is often used for tasks such as classification, regression, and object detection.

In unsupervised learning, the machine is trained on a set of unlabeled data, which means that the input
data is not paired with the desired output. The machine then learns to find patterns and relationships in the
data. Unsupervised learning is often used for tasks such as clustering, dimensionality reduction, and
anomaly detection.

What is Supervised learning?

Supervised learning is a type of machine learning algorithm that learns from labeled data. Labeled data is
data that has been tagged with a correct answer or classification.

unsuppervised learning, as the name indicates, has the presence of a


supervisor as a teacher. Supervised learning is when we teach or train the
machine using data that is well-labelled. Which means some data is already
tagged with the correct answer. After that, the machine is provided with a new
set of examples(data) so that the supervised learning algorithm analyses the
training data(set of training examples) and produces a correct outcome from
labeled data.
For example, a labeled dataset of images of Elephant, Camel and Cow would
have each image tagged with either “Elephant” , “Camel”or “Cow.”
Key Points:
 Supervised learning involves training a machine from labeled data.
 Labeled data consists of examples with the correct answer or classification.
 The machine learns the relationship between inputs (fruit images) and
outputs (fruit labels).
 The trained machine can then make predictions on new, unlabeled data

Types of Supervised Learning

Supervised learning is classified into two categories of algorithms:


 Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
 Classification : A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”.
Supervised learning deals with or learns with “labeled” data. This implies that
some data is already tagged with the correct answer.
1- Regression
Regression is a type of supervised learning that is used to predict continuous
values, such as house prices, stock prices, or customer churn. Regression
algorithms learn a function that maps from the input features to the output
value.
Some common regression algorithms include:
 Linear Regression
 Polynomial Regression
 Support Vector Machine Regression
 Decision Tree Regression
 Random Forest Regression
2- Classification
Classification is a type of supervised learning that is used to predict categorical
values, such as whether a customer will churn or not, whether an email is spam
or not, or whether a medical image shows a tumor or not. Classification
algorithms learn a function that maps from the input features to a probability
distribution over the output classes.
Some common classification algorithms include:
 Logistic Regression
 Support Vector Machines
 Decision Trees
 Random Forests
 Naive Baye
Applications of Supervised learning

Supervised learning can be used to solve a wide variety of problems, including:


 Spam filtering: Supervised learning algorithms can be trained to identify
and classify spam emails based on their content, helping users avoid
unwanted messages.
 Image classification: Supervised learning can automatically classify images
into different categories, such as animals, objects, or scenes, facilitating
tasks like image search, content moderation, and image-based product
recommendations.
 Medical diagnosis: Supervised learning can assist in medical diagnosis by
analyzing patient data, such as medical images, test results, and patient
history, to identify patterns that suggest specific diseases or conditions.
 Fraud detection: Supervised learning models can analyze financial
transactions and identify patterns that indicate fraudulent activity, helping
financial institutions prevent fraud and protect their customers.
 Natural language processing (NLP): Supervised learning plays a crucial
role in NLP tasks, including sentiment analysis, machine translation, and text
summarization, enabling machines to understand and process human
language effectively.
Advantages of Supervised learning
 Supervised learning allows collecting data and produces data output from
previous experiences.
 Helps to optimize performance criteria with the help of experience.
 Supervised machine learning helps to solve various types of real-world
computation problems.
 It performs classification and regression tasks.
 It allows estimating or mapping the result to a new sample.
 We have complete control over choosing the number of classes we want in
the training data.
UNIT-2
Deep Matrix Factorization and approximation guarantees
Recently, deep matrix factorization was proposed to cope with
extracting several layers of features. The fundamental purpose of
deep MF is to combine interpretability, as in conventional Matrix
Factorizations, and the extraction of numerous hierarchical features,
as permitted by multi-layer structures
Recommendation engines are widely used models that attempt to
identify items that a person will like based on that person’s past
behavior. We’re all familiar with Amazon’s recommendations based on
your past purchasing history, and Netflix recommending shows to you
based on your history and the ratings you’ve given other shows.
Naturally, deep learning is behind many of these systems. In this
tutorial, we will delve into how to use deep learning to build these
recommender systems, and specifically how to implement a technique
called matrix factorization using Apache MXNet. It presumes basic
familiarity with MXNet.

The Netflix Prize is likely the most famous example many data
scientists explored in detail for matrix factorization. Simply put, the
challenge was to predict a viewer’s rating of shows that they hadn’t
yet watched based on their ratings of shows that they had watched in
the past, as well as the ratings that other viewers had given a show.
This takes the form of a matrix where the rows are users, the columns
are shows, and the values in the matrix correspond to the rating of the
show. Most of the values in this matrix are missing, as users have not
rated (or seen) the majority of shows. Below is an illustrative example
of what this data tends to look like, where ratings are shown in purple
and missing values are denoted in white.
The best techniques for filling in these missing values can vary
depending on how sparse the matrix is. If there are not many missing
values, frequently the mean or median value of the observed values
can be used to fill in the missing values. If the data is categorical, as
opposed to numeric, the most frequently observed value is used to fill
in the missing values. This technique, while simple and fast, is fairly
naive, as it misses the interactions that can happen between the
variables in a data set. Most egregiously, if the data is numeric
and bimodal between two distant values, the mean imputation
method can create values that otherwise would not exist in the data
set at all. These improper imputed values increase the noise in the
data set the sparser it is, so this becomes an increasingly bad strategy
as sparsity increases.
Matrix factorization is a simple idea that tries to learn connections
between the known values in order to impute the missing values in a
smart fashion. Simply put, for each row and for each column, it
learns (k) numeric “factors” that represent the row or column. In the
example of the Netflix challenge, the factors for movies could
correspond to “is this a comedy?” or “is a big name actor/actress in
this movie?” The factors for users might correspond to “does this user
like comedies?” and “does this user like this actor/actress?” Ratings
are then determined by a dot product between these two smaller
matrices (the user factors and the movie factors) such that an
individual prediction for how user (i) will rate movie (j) is
calculated by:
�������,�=∑�=1������,�⋅������,�
This would line up the user factor “does this user like comedies?” with
the movie factor “is this a comedy?” and give a boost if both were
positive.

It is impractical to pre-assign meaning to these factors at large scale;


rather, we need to learn the values directly from data. Once these
factors are learned using the observational data, a simple dot product
between the matrices can allow one to fill in all of the missing values.
If two movies are similar to each other, they will likely have similar
factors, and their corresponding matrix columns will have similar
values. The upside to this approach is that an informative set of
factors can be learned automatically from data, but the downside is
that the meaning of these factors might not be easily interpretable.
This may or may not be an issue for your application. Figure 2 shows
the two steps in this approach. On the left is the training step, where
the factors are learned using the observed data, and on the right is the
prediction step, where the missing values are imputed using the dot
product between the two factor matrices.

Matrix factorization is a linear method, meaning


that if there are complicated non-linear
interactions going on in the data set, a simple dot
product may not be able to handle it well. Given the
recent success of deep learning in complicated
non-linear computer vision and natural language
processing tasks, it is natural to want to find a way
to incorporate it into matrix factorization as well. A
way to do this is called “deep matrix factorization”
and involves the replacement of the dot product
with a neural network that is trained jointly with
the factors. This makes the model more powerful
because a neural network can model important
non-linear combinations of factors to make better
predictions.

In traditional matrix factorization, the prediction is the simple dot


product between the factors for each of the dimensions. In contrast, in
deep matrix factorization the factors for both are concatenated
together and used as the input to a neural network whose output is the
prediction. The parameters in the neural network are then trained
jointly with the factors to produce a sophisticated non-linear model for
matrix factorization.

This tutorial will first introduce both traditional and deep matrix
factorization by using them to model synthetic data. We will then move
on to a real-world example and apply these concepts to model the
MovieLens 20M data set, which is conceptually identical to the Netflix
Prize challenge. Lastly, we’ll explore how one can use recent
advances in deep learning and the flexibility of neural networks to
build even more complex models. While this may sound complicated,
fortunately Apache MXNet makes it easy!

Geometric perspective of expressivity


The expressive power of Graph Neural Networks (GNNs) has been studied extensively through
the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL
framework are inapplicable for geometric graphs embedded in Euclidean space, such as
biomolecules, materials, and other physical systems. In this work, we propose a geometric
version of the WL test (GWL) for discriminating geometric graphs while respecting the
underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL
to characterise the expressive power of geometric GNNs that are invariant or equivariant to
physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design
choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as
they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a
larger class of graphs by propagating geometric information beyond local neighbourhoods; (3)
Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4)
GWL’s discrimination-based perspective is equivalent to universal approximation. Synthetic
experiments

expressive power of Graph Neural Networks (GNNs) has been studied extensively through the
Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL
framework are inapplicable for geometric graphs embedded in Euclidean space, such as
biomolecules, materials, and other physical systems. In this work, we propose a geometric
version of the WL test (GWL) for discriminating geometric graphs while respecting the
underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL
to characterise the expressive power of geometric GNNs that are invariant or equivariant to
physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design
choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as
they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a
larger class of graphs by propagating geometric information beyond local neighbourhoods; (3)
Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4)
GWL's discrimination-based perspective is equivalent to universal approximation.

Input domain partitions and random paths


Recent advances in machine learning and artificial intelligence are now being considered in safety-
critical autonomous systems where software defects may cause severe harm to humans and the
environment. Design organizations in these domains are currently unable to provide convincing
arguments that their systems are safe to operate when machine learning algorithms are used to
implement their software

In recent years, artificial intelligence utilizing machine learning algorithms has begun to outperform
humans at several tasks, e.g., playing board games and diagnosing skin cancer These advances are now
being considered in safety-critical autonomous systems where software defects may cause severe harm
to humans and the environment, e.g, airborne collision avoidance systems Several researchers have
raised concerns regarding the lack of verification methods for these kinds of systems in which machine
learning algorithms are used to train software deployed in the system. Within the avionics sector,
guidelines describe how design organizations may apply formal methods to the verification of safety-
critical software. Applying these methods to complex and safety-critical software is a non-trivial task due
to practical limitations in computing power and challenges in qualifying complex verification tools. These
challenges are often caused by a high expressiveness provided by the language in which the software is
implemented in. Most research that apply formal methods to the verification of machine learning is so
far focused on the verification of neural networks, but there are other models that may be more
appropriate when verifiability is important, e.g., decision trees , random forests a

gradient boosting machines Where neural networks are subject to verification of correctness, they are
usually adopted in non-safety-critical cases, e.g., fairness with respect to individual discrimination,
where probabilistic guarantees are plausible [2]. Since our aim is to support safety arguments for digital
artefacts that may be deployed in hazardous situations, we need to rely on guarantees of absence of
misclassifications, or at least recognize when such guarantees cannot be provided. The tree-based
models provide such an opportunity since their structural simplicity makes them easy to analyze
systematically, but large (yet simple) models may still prove hard to verify due to combinatorial
explosion.

• A generalization to include more tree ensembles, e.g., gradient boosting machines, with an updated
tool support (VoTE). • A Soundness proof of the associated approximation technique used for this
purpose. • An improved node selection strategy that yields speed improvements in the range of 4 %–91
% for trees

Formal Verification of Neural Networks There has been extensive research on formal verification of
neural networks. Pulina and Tacchella [25] combine SMT solvers with an abstraction-refinement
technique to analyze neural networks with non-linear activation functions. They conclude that formal
verification of realistically sized networks is still an open challenge. Scheibler et al. [27] use bounded
model checking to verify a non-linear neural network controlling an inverted pendulum. They encode
the neural network and differential equations of the system as an SMT formula, and try to verify
properties without success.

DL through lense of Random Matrix Theory


Random Matrix Theory Proves that Deep Learning Representations of GAN-data
Behave as Gaussian Mixtures. This paper shows that deep learning (DL)
representations of data produced by generative adversarial nets (GANs) are random
vectors which fall within the class of so-called \textit{concentrated} random vectors

Random matrix theory, at its inception, primarily dealt with the eigenvalue distribution (also
referred to as the spectral measure) of large dimensional random matrices. One of the key
technical tools to study these measures is the Stieltjes transform, often presented as the central
object of the theory [Bai and Silverstein, 2010, Pastur and Shcherbina, 2011]. But signal
processing and machine learning alike are often more interested in subspaces and eigenvectors
(which often carry the structural information of the data) than in eigenvalues. Subspace or
spectral methods, such as principal component analysis (PCA ], spectral clustering and some
semi-supervised learning technique are built directly upon the eigenspace spanned by the
several top eigenvectors. Consequently, beyond the Stieltjes transform, a more general
mathematical object, the resolvent of large random matrices will constitute the cornerstone of
the monograph. The resolvent of a matrix gives access to its spectral measure, to the location
of its isolated eigenvalues, to the statistical behavior of their associated eigenvectors when
random, and consequently provides an entry-door to the performance analysis of numerous
machine learning methods. This chapter introduces the fundamental objects and tools
necessary to characterize the behavior of large dimensional random matrices (the resolvent,
the Stieltjes transform method, etc.) in Section 2.1, with a particular focus on the modern and
powerful technical approach of deterministic equivalents. Section 2.2 then presents some
foundational random matrix results

Modern machine learning tasks are often high-dimensional, due to the large amount of
data, features, and trainable parameters. Mathematical tools such as random matrix
theory have been developed to precisely study simple learning models in the high-
dimensional regime, and such precise analysis can reveal interesting phenomena that
are also empirically observed in deep learning.

In this talk, Denny Wu from the University of Toronto will introduce two examples
concluded by him and his research team. He will first talk about the selection of
regularisation hyperparameters in the over-parameterised regime and how the team
established a set of equations that rigorously describes the asymptotic generalisation
error of the ridge regression estimator. These equations led to some surprising findings:
(i) the optimal ridge penalty can be negative and (ii) regularisation can suppress
“multiple descents” in the risk curve. He will then discuss practical implications such as
the implicit bias of first- and second-order optimisers in neural network training.

Next, we will go beyond linear models and characterize the benefit of gradient-based
representation (feature) learning in neural networks. By studying the precise
performance of kernel ridge regression on the trained features in a two-layer neural
network, his team has proven that feature learning could result in a considerable
advantage over the initial random features model, highlighting the role of learning rate
scaling in the initial phase of gradient descent.

Deterministic and random equivalents This monograph is concerned with the situation where M is a
large dimensional random matrix, the eigenvalues and eigenvectors of which need be related to the
statistical nature of the model design of M. In the early days of random matrix theory, the main focus
was on the limiting spectral measure of M ∈ R n×n, that is the characterization of a certain “limit” to the
spectral measure µM of M as the size of M increases. For this purpose, a natural approach is to study
the random Stieltjes transform mµM(z) and to show that it admits a limit (in probability or almost
surely) m(z) as n → ∞. However, this method has strong limitations: (i) it supposes that such a limit
exists, therefore restricting the study to very regular models for M and (ii) it only quantifies 1 n tr QM
(through the Stieltjes transform), thereby discarding all subspace information about M carried in the
resolvent QM. As a consequence, a further study of the eigenvectors of M often requires a complete
rework. To avoid these limitations, modern random matrix theory focuses instead on the notion of
deterministic equivalents which are deterministic matrices – thus finite dimensional objects rather than
limits – having (in probability or almost surely) asymptotically the same scalar observations as the
random ones.2 In particular, these scalar observations of deterministic equivalents

Evolution for signal variances and covariances in


infinite depth DNN
The empirical success of over-parameterized neural networks has sparked a growing interest in the
theoretical understanding of these models. The large number of parameters – millions if not billions –
and the complex (non-linear) nature of the neural computations (presence of non-linearities) make this
hypothesis space highly non-trivial. However, in certain situations, increasing the number of parameters
has the effect of ‘placing’ the network in some ‘average’ regime that simplifies the theoretical analysis.
This is the case with the infinitewidth asymptotics of random neural network

Infinite-width-then-infinite-depth limit: in this case, the width is taken to infinity first, then the depth is
take to infinity. This is the infinite-depth limit of infinite-width neural networks. This limit was
particularly used to derive the Edge of Chaos initialization scheme

The joint infinite-width-and-depth limit: in this case, the depth-to-width ratio is fixed, and therefore, the
width and depth are jointly taken to infinity at the same time. There are few works that study the joint
width-depth limit.

Infinite-depth limit of finite-width neural networks: in both previous limits (infinite-width-then-


infinitedepth limit, and the joint infinite-width-depth limit), the width goes to infinity. Naturally, one
might ask what happens if width is fixed and depth goes to infinity? What is the limiting distribution of
the network output at initialization

The empirical success of modern Deep Neural Networks (DNNs) has sparked a growing interest in the
theoretical understanding of these models. An important development in this direction was the
introduction of the Neural Tangent Kernel (NTK) framework

which provides a dual view of Gradient Descent (GD) in function space. The NTK is the dot product
kernel of Tangent Features (gradient features), given by

K(x, x0 ) = ∇θf(x)∇θf(x 0 ) T ,

where f is the network output and θ is the vector of model parameters. The NTK has been the subject of
an extensive body of literature, both in the NTK regime where the NTK remains constant during training

Role of hidden layers. In the context of DNNs, hidden layers act as successive embeddings that evolve in
datadependent directions during training. However, it is still unclear how different layers contribute to
model performance. A possible approach in this direction is the study of feature learning in each layer,
i.e. how features adapt to data as we train the model. A simple way to do this is by measuring the
change in the values of pre-activations as we train the network;

For each layer l, they measured the alignment between the layer tangent kernel matrix Kˆ l (where Kl(x,
x0 ) = ∇θl f(x)∇θl f(x 0 ) T , θl being the vector of parameters in the l th layer). The kernel Kˆ l can be seen
as the NTK of the network if only the parameters of the l th layer are allowed to change (other layers are
frozen, see Section 2.1 for a detailed discussion). The authors demonstrated the existence of an
alignment pattern in which the tangent features of some hidden layers are significantly more aligned
with data labels compared to other layers
In general, the alignment reaches its maximum for some hidden layer and tends to be minimal in the
external layers (first and last). To the best of our knowledge, no explanation has been provided for this
phenomenon in the literature. In this paper, we propose an explanation based on the theory of signal
propagation in randomly initialized neural networks. Our intuition is based on the observation that
tangent kernels can be decomposed as an Hadamard product of two quantities linked to how signal
propagates in DNNs.

Random DNN
Deep Neural Networks (DNNs) are a fundamental tool in the modern development of
Machine Learning. Beyond the merits of the training algorithms, a great part of DNNs
success is due to the inherent properties of their layered architectures, i.e., to the
introduced architectural biases. In this tutorial, we explore recent classes of DNN
models wherein the majority of connections are randomized or more generally fixed
according to some specific heuristic.

Limiting the training algorithms to operate on a reduced set of weights implies intriguing
features. Among them, the extreme efficiency of the learning processes is undoubtedly
a striking advantage with respect to fully trained architectures. Besides, despite the
involved simplifications, randomized neural systems possess remarkable properties
both in practice, achieving state-of-the-art results in multiple domains, and theoretically,
allowing to analyze intrinsic properties of neural architectures.

This tutorial covers all the major aspects regarding Deep Randomized Neural Networks,
from feed-forward and convolutional neural networks, to dynamically recurrent deep
neural systems for structures. The tutorial is targeted to both researchers and
practitioners, from academia or industry, who is interested in developing DNNs that can
be trained efficiently, and possibly embedded into low-powerful devices.

Randomization can enter the design of Deep Neural Nets in several disguises, including
training algorithms, regularization, etc. We focus on “Random-weights” neural
architectures where part of the weights are left untrained after initialization.

Because we touch on a number of different fields, we do not aim at a comprehensive


survey of the literature. Rather, we highlight general ideas and concepts by a careful
selection of papers and results, trying to convey the widest perspective. When possible,
we also highlight points that in our opinion have been under-explored in the literature,
and possible open areas of research. Finally, we consider a variety of types of data,
ranging from vectors to images and graph-based datasets.

The tutorial is structured as follows.


 Introduction: Preliminaries on Deep Learning, embedded applications,
complexity/accuracy trade-off, randomization in Neural Networks.
 Randomization in Feed-forward Neural Networks: Historical notes, Random
Projections, Intrinsic dimensionality, Random kitchen sinks, Random Vector
Functional Link.
 Deep Random-weights Neural Networks: Deep Randomized Networks for
analysis and with random gaussian weights, Randomized autoencoders, Semi-
random features, Deep Image Prior, Direct Feedback Alignment.
 Randomization in Dynamical Recurrent Neural Networks: Motivationsand
preliminaries on Recurrent Neural Networks and Neuromorphic Neural Networks,
Reservoir Computing, Stability Properties in Randomized Neural Networks,
Universality, Fundamental topics in Reservoir Computing (e.g., reservoir topology
and the edge of chaos), Applications.
 Deep Reservoir Computing: Deep Recurrent Neural Networks, Deep Echo
State Networks, Properties of Deep ESNs, Intrinsic richness in deep recurrent neural
systems, Applications, Fast and Deep Neural Networks for Graphs.

Jacobian Matrix
The Jacobian matrix and determinant are fundamental mathematical concepts that play a crucial
role in understanding the relationships between variables in machine learning models. In this blog
post, we’ll delve into these concepts, explaining what they are, how they are calculated, and their
significance in the context of machine learning. We’ll also provide practical examples and
demonstrate how to compute the Jacobian matrix and determinant using Python.

The Jacobian matrix is a mathematical construct that captures the relationships between multiple
input variables and multiple output variables in a system. It is particularly useful in scenarios
where a function or transformation has multiple inputs and multiple outputs. In the context of
machine learning, the Jacobian matrix helps us analyze how changes in the input variables affect
the output variables.

Calculating the Jacobian Matrix

To compute the Jacobian matrix, we need to calculate the partial derivatives of each output

variable with respect to each input variable. These partial derivatives capture the rate of change or

sensitivity of each output variable to changes in each input variable.


Let’s illustrate this with a simple example. Suppose we have a machine learning model that

predicts the price of a house based on its size (in square feet) and the number of bedrooms. The

model takes these two features as inputs and produces a single output, which is the predicted price

of the house.

To calculate the Jacobian matrix, we need to compute the partial derivatives of the output (price)

with respect to each input variable (size and number of bedrooms). Let’s assume we have the

following data:

House 1:

 Size: 1500 square feet

 Bedrooms: 3

 Price: $250,000

House 2:

 Size: 2000 square feet

 Bedrooms: 4

 Price: $320,000

To calculate the partial derivative with respect to size, we can calculate the change in price for a

small change in size while keeping the number of bedrooms constant. Let’s say we increase the

size by 100 square feet:


House 1 (increased size):

 Size: 1600 square feet

 Bedrooms: 3

 Price: $270,000

House 2 (increased size):

 Size: 2100 square feet

 Bedrooms: 4

 Price: $340,000

Now, we can calculate the partial derivative with respect to size for each house:

Partial derivative of price with respect to size (House 1): = (New Price — Old Price) / (New Size

— Old Size) = ($270,000 — $250,000) / (1600–1500) = $20,000 / 100 = $200 per square foot

Partial derivative of price with respect to size (House 2): = (New Price — Old Price) / (New Size

— Old Size) = ($340,000 — $320,000) / (2100–2000) = $20,000 / 100 = $200 per square foot

Similarly, we can calculate the partial derivative with respect to the number of bedrooms by

changing the number of bedrooms while keeping the size constant.


By calculating these partial derivatives for multiple data points, we can construct the Jacobian

matrix. In this case, the Jacobian matrix would be a 2x1 matrix (2 inputs and 1 output) containing

the partial derivatives we calculated.

Understanding the Jacobian Matrix


The Jacobian matrix provides valuable insights into the relationships between the input and output

variables in a machine learning model. By examining the values in the Jacobian matrix, we can

understand how changes in the input variables impact the output variables. A large value in the
Jacobian matrix entry indicates a strong influence, while a small value indicates a weak influence.

For instance, in our example, if the partial derivative of price with respect to size has a large

magnitude, it suggests that the size of the house has a significant impact on its price prediction.

On the other hand, if the partial derivative of price with respect to the number of bedrooms has a

small magnitude, it indicates that the number of bedrooms has a relatively weak influence on the

price prediction.

Analyzing the Jacobian matrix can help us uncover important insights about feature importance,

model behavior, and optimization. It allows us to identify the most influential input variables and

make informed decisions about the model.

DNN Gaussian
A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit

of a certain type of sequence of neural networks. Specifically, a wide variety of network

architectures converges to a GP in the infinitely wide limit, in the sense of distribution.

Bayesian networks are a modeling tool for assigning probabilities to events, and thereby

characterizing the uncertainty in a model's predictions. Deep learning and artificial neural

networks are approaches used in machine learning to build computational models which learn

from training examples. Bayesian neural networks merge these fields. They are a type of neural

network whose parameters and predictions are both probabilistic.[9][10] While standard neural

networks often assign high confidence even to incorrect predictions

Computation in artificial neural networks is usually organized into sequential layers of artificial

neurons. The number of neurons in a layer is called the layer width. When we consider a

sequence of Bayesian neural networks with increasingly wide layers (see figure), they converge

in distribution to a NNGP. This large width limit is of practical interest, since the networks often

improve as layers get wider. And the process may give a closed form way to evaluate networks.

NNGPs also appears in several other contexts: It describes the distribution over predictions made

by wide non-Bayesian artificial neural networks after random initialization of their parameters,

but before training; it appears as a term in neural tangent kernel prediction equations; it is used

in deep information propagation to characterize whether hyperparameters and architectures will

be trainable.[15] It is related to other large width limits of neural networks.

Every setting of a neural network's parameters corresponds to a specific function

computed by the neural network. A prior distribution over neural network parameters
therefore corresponds to a prior distribution over functions computed by the network. As neural
networks are made infinitely wide, this distribution over functions converges to a Gaussian
process for many architectures.

The notation used in this section is the same as the notation used below to derive the
correspondence between NNGPs and fully connected networks, and more details can be found
there.

Orthogonal initialization
Orthogonal initialization initializes the weight matrix with an orthogonal
matrix. This method helps to avoid the vanishing and exploding
gradients problem that can occur in deep neural networks.

initializes the weight matrix with an orthogonal matrix. This method


helps to avoid the vanishing and exploding gradients problem that
can occur in deep neural networks
Here's an example of how to use orthogonal initialization in
Python:

How better weight initialization methods address the


shortcomings of basic methods?
The better weight initialization methods, such as Xavier initialization,
He initialization, LeCun initialization, and orthogonal initialization,
address the shortcomings of basic methods by taking into account the
number of input and output units of each layer, and by being
designed to prevent the vanishing and exploding gradients problem.
Randomized ReLU Initialization
Randomized ReLU initialization is a method that can be used with
ReLU and its variants, such as leaky ReLU and ELU, to improve the
performance of deep neural networks. The basic idea behind
randomized ReLU initialization is to randomly set a fraction of the
ReLU units to zero during initialization. This encourages the remaining
units to learn different features, which can help prevent overfitting and
improve the generalization performance of the model.
Specifically, during initialization, a fraction of the ReLU units are
randomly set to zero with probability p, while the remaining units are
initialized with the standard ReLU initialization method, such as He
initialization. The value of p can be tuned to achieve the best
performance on the specific task. Randomized ReLU initialization has
been shown to be effective in improving the performance of deep
neural networks on a variety of tasks, including image classification
and object detection.

UNIT-3
Dl through lense of information Theory-

In the case of deep learning, the most common use case for
information theory is to characterize probability distributions and to
quantify the similarity between two probability distributions. We use
concepts like KL divergence and Jensen-Shannon divergence for
these purposes.

Information theory is based on probability theory and statistics


and often concerns itself with measures of information of the
distributions associated with random variables. Important
quantities of information are entropy, a measure of information
in a single random variable, and mutual information, a measure
of information in common between two random variables.
Information theory revolves around quantifying how much
information is present in a signal. It was originally invented to
study sending messages from discrete alphabets over a noisy
channel, such as communication via radio transmission. The
basic intuition behind information theory is that learning that
an unlikely event has occurred is more informative than
learning that a likely event has occurred
In the early 20th century, computer scientists and mathematicians around the
world were faced with a problem. How to quantify information? Let’s consider
the two sentences below:

 It is going to rain tomorrow


 It is going to rain heavily tomorrow

We a humans intuitively understand the both sentences transmit different


amounts of information. The second one is much more informative than the
first. But, how do you quantify it? How do you express that in the language of
mathematics?

Enter Claude E. Shannon with his seminal paper “A Mathematical Theory of


Communication”[1]. Shannon introduced the qualitative and quantitative
model of communication as a statistical process. Among many other ideas
and concepts, he introduced Information Entropy, and the concept of ‘bit’-the
fundamental unit of measurement of Information.

Information Theory is quite vast, but I’ll try to summarise key bits of
information(pun not intended) in a short glossary.

 Information is considered to be a sequence of symbols to be transmitted


through a medium, called a channel.
 Entropy is the amount of uncertainty in a string of symbols given some
knowledge of the distribution of symbols.
 Bit is a unit of information, or a sequence of symbols.
 To transfer 1 bit of information means reducing the uncertainty of the
recipient by 2
 For eg. In a coin toss, before we see the coin rest, there are two
possible outcomes. Once the coin rests and we find that it is heads, the
possible outcome become just 1. So in this situation, the transfer of
information by the coin toss is 1 bit.

1. Entropy

The Shannon entropy H, in units of bits (per symbol) is given by

H= −∑ ipilog2(pi))

where pi�� is the probability of occurrence of


the ith��ℎ possible value of the source symbol. Intuitively,
the entropy HX�� of a random variable X� gives us a
measure of the amount of uncertainty associated with the
value of X when only its distribution is known. Below is an
example of the entropy of a Bernoulli trial as a function of the
success probability (called the binary entropy function):
We can see that this function is maximized when the two
possible outcomes are equally probable, for example, like in an
unbiased coin toss. In general for any random variable X�, the
entropy H� is defined as

H(X)=EX[I(x)]=−∑x∈Xp(x)logp(x)

where X� is the set of all messages {x1,...,xn}


{�1,...,��} that X� could be and I(x)�(�) is the self-
information, which is the entropy contribution of an individual
message. As is evident from the Bernoulli case above, the
entropy of a space is maximized when all the messages in the
space are equally probable.

2. Joint Entropy-This concept of entropy applies when there are


two random variables (say X� and Y�). The joint probability
of (X,Y)(�,�) is defined as:
H(X,Y)=EX,Y[−logp(x,y)]=−∑x,yp(x,y)logp(x,y)

This entropy is equal to the sum of the individual entropies


of X� and Y� when they are independent. I am sure you
might have heard about cross entropy a lot while studying loss
functions and optimization in deep learning and I want to
emphasize that it is different from joint entropy and you should
not confuse the joint entropy of two random variables with
cross entropy.

3. Conditional Entropy

The conditional entropy (also called conditional uncertainty) of


a random variable X� given a random variable Y� is the
average conditional entropy over Y�:
H(X∣Y)=EY[H(X∣y)]=−∑y∈Yp(y)∑x∈Xp(x∣y)logp(x∣y)=−∑x,yp(
x,y)logp(x∣y)
One basic property that the conditional entropy satisfies is
given by:

H(X∣Y)=H(X,Y)−H(Y)

4. Mutual Information

Using mutual information, we can quantify the amount of


information that can be obtained for one random variable using
the information from the other. The mutual information of X
relative to Y is given by:

I(X;Y)=EX,Y[SI(x,y)]=∑x,yp(x,y) logp(x,y)/p(x)p(y)

where SI(x,y)��(�,�) is the Specific Mutual Information (also


called pointwise mutual information

A basic property of the mutual information is that

I(X;Y)=H(X)−H(X∣Y)
Intuitively, this means that knowing Y�, we can save an
average of I(X;Y)�(�;�) bits in encoding X� compared to
not knowing Y�.

Mutual information is symmetric:

I(X;Y)=I(Y;X)=H(X)+H(Y)−H(X,Y).
5- Markov Chain
A Markov property (also called the memoryless property) of a
stochastic process refers to the dependence of conditional
probability distributions of future states of the process only on
the present states and not on the past states. A process with
this property is called a Markov process. A Markov Chain is a
type of Markov process that has a discrete state space.

6- Data Processing Inequality


For any Markov chain: X→Y→Z�→�→�, we would
have I(X;Y)≥I(X;Z)�(�;�)≥�(�;�) A deep neural network can
be viewed as a Markov chain where information is passed from
one layer to the next, and thus when we are moving down the
layers of a DNN, the mutual information between the layer and
the input can only decrease.

7- Reparametrization invariance

For two invertible functions ϕ�, ψ�, the mutual information


still holds: I(X;Y)=I(ϕ(X);ψ(Y)) In DNN context, this means
that if we shuffle the weights in one layer of DNN, it would not
affect the mutual information between this layer and the other
layers. This is bad news for anyone who is trying to use mutual
information to calculate the computational complexity
because this means creating any hard to invert transformation
of our data would not affect the information measures but it
will increase the computational complexity greatly.

Now that we have summarized all the relevant terms and


concepts to get familiar with information theory, let us move
on to the task of linking this to deep learning and how it might
be useful in the field.

Kullback-Leibler Divergence
Now that we have understood Entropy and Cross Entropy, we are in a
position to understand another concept with a scary name. The name is so
scary that even practitioners call it KL Divergence instead of the actual
Kullback-Leibler Divergence. This monster from the depths of information
theory is a pretty simple and easy concept to understand.

KL Divergence measures how much one distribution diverges from another. In


our example, we use the coding scheme we used for communication from
India to Australia to do the reverse communication also. In machine learning,
this can be though of as trying to approximate a true distribution(p) with a
predicted distribution(q). But doing this we are spending more bits than
necessary for sending the message. Or, we are having some loss because we
are approximating p with q. This loss of information is called KL Divergence.

Two key properties of KL Divergence is worth to note:

1. KL Divergence is non-negative. i.e. It is always zero or a positive number.


2. KL Divergence is non-symmetric. i.e. KL Divergence from p to q is not
equal to KL Divergence of q to p. And because of this, it is not strictly a
distance.

The Math
Let’s take a look at how we arrive at the formula, which will enhance our
understanding of KL Divergence.

We know that from our intuition that KL(P||Q) is the information lost when
approximating P with Q. What would that be?

Just to recap the analogy of our example to an ML system to help you connect
the dots better. Let’s use the situation where Australia is sending messages to
India using the same coding scheme as was devised for messages from India
to Australia.

 The box with balls(in India and Australia) are probability distributions
 When we prepare a machine learning model(classification or regression),
what we are doing is approximating a probability distribution and in most
cases output the expected value of the probability distribution.
 Devising a coding scheme based on a probability distribution is like
preparing a model to output that distribution.
 The box with balls in India is the predicted distribution, because this is the
distribution which we assume when we devise the coding scheme.
 The box with the balls in Australia is the true distribution, because this is
where the coding scheme is used.

We know the Expected Value of Bits when we use the coding scheme in
Australia. It’s the Cross Entropy of Q w.r.t. P (where P is the predicted
distribution and Q is the true distribution).

Categorical Cross Entropy Loss


In an N-way classification problem, the neural network, typically, has N output
nodes(with an exception for a binary classification, which has just one node).
These nodes’ outputs are also called the logits. Logits are the real valued
output of a Neural Network before we apply an activation. If we pass these
logits through a softmax layer, the outputs will be transformed into something
resembling the probability(statisticians will be turning in their graves now) that
the sample is that particular class.

The benefit of Stochastic Relaxation

In order for us to understand the dynamics of the drift and the


diffusion phase, Tishby talks about the role of noise induced by
the mini-batches in the SGD process. He says that since there
are only a small number of relevant dimensions (the features
that actually change the label) out of a large number of total
dimensions in the input space, for example, in a face
recognition task, only a small number of features such as the
distance between the eyes, the size of the nose, etc. are
important. All the other million dimensions are not relevant.
This results in a covariance matrix of the noise which is quite
small in the dimension of the relevant features and really
elongated in the dimension of irrelevant features but this
property is good and not bad. This is because during the drift
phase, we quickly converge in the direction of the relevant
dimensions and we stay there. But now we start doing random
walks in the remaining millions of irrelevant dimensions and
their total averaged effect increases the entropy in the
irrelevant dimensions of the problem. The point that he tries to
emphasize with this discussion is that the time that it takes to
increase the entropy is exponential in the entropy which
explains why the diffusion phase is so slow as compared to the
drift phase. But this phase is

The benefit of multiple Hidden Layers

As discussed above, the time taken in the diffusion phase is


very large and it takes exponential time for every bit of
compression. What multiple hidden layers do for us is that they
run the Markov chain using parallel diffusion process. Noise is
added to all these layers and hence they start to push each
other there is a boost in the time as they all help each other in
compression. So there is an exponential improvement in
running time as we move from an exponential sum to the
maximum of these exponential time terms. In this way, the
layers help each other and each layer has to compress only
from the point where the previous layer forgot.
Implementation in Popular Deep Learning Frameworks
The way these losses are implemented in the popular Deep Learning
Frameworks like PyTorch and Tensorflow are a little confusing. This is
especially true for PyTorch.

PYTORCH

 binary_cross_entropy – This expects the logits to be passed through a


sigmoid layer before computing the loss
 binary_cross_entropy_with_logits – This expects the raw outputs or the
logits to be passed to the loss. The loss implementation applies a sigmoid
internally
 cross_entropy – This expects the logits as the input and it applies a
softmax(technically a log softmax) before calculating the entropy loss
 nll_loss – This is the plain negative log likelihood loss. This expects the
logits to be passed through a softmax layer before computing the loss
 kl_div – This expects the logits to be passed through a softmax layer
before computing the loss

TENSORFLOW 2.0 / KERAS

All the Tensorflow 2.0 losses expects probabilities as the input by default, i.e.
logits to be passed through Sigmoid or Softmax before feeding to the losses.
But they provide a parameter from_logits which is set to True will accept logits
and do the conversion to probabilities inside.

 BinaryCrossentropy – Calculates the BCE loss for


a y_true and y_pred. y_true and y_pred are single dimensional tensors – a
single value for each sample.
 CategoricalCrossentropy – Calculates the Cross Entropy for
a y_true and y_pred. y_true is a one-hot representation of labels
and y_pred is a multi-dimensional tensor – # of classes values for each
sample
 SparseCategoricalCrossentropy – Calculates the Cross Entropy for
a y_true and y_pred. y_true is a single dimensional tensor-a single value
for each sample- and y_pred is a multi-dimensional tensor – # of classes
values for each sample
 KLDivergence – Calculates KL Divergence from y_pred to y_true. This is
an exception in the sense that this always expects probabilities as the
input.

MUTUAL INFORMATION ESTIMATION IN DNN

Information theory concepts are leveraged with the goal of better


understanding and improving Deep Neural Networks (DNNs). The
information plane of neural networks describes the behavior during
training of the mutual information at various depths between
input/output and hidden-layer variables. Previous analysis revealed that
most of the training epochs are spent on compressing the input, in some
networks where finiteness of the mutual information can be established.
However, the estimation of mutual information is nontrivial for high-
dimensional continuous random variables. Therefore, the computation of
the mutual information for DNNs and its visualization on the information
plane mostly focused on low-complexity fully connected networks. In
fact, even the existence of the compression phase in complex DNNs has
been questioned and viewed as an open problem. In this paper, we
present the convergence of mutual information on the information plane
for a high-dimensional VGG-16 Convolutional Neural Network (CNN) by
resorting to Mutual Information Neural Estimation (MINE), thus
confirming and extending the results obtained with low-dimensional
fully connected networks. Furthermore, we demonstrate the benefits of
regularizing a network, especially for a large number of training epochs,
by adopting mutual information estimates as additional terms in the loss
function characteristic of the network.

The Information Bottleneck in DNNs

The mutual information is a measure of the mutual dependence of two


random variables. In essence, it measures the relationship between them
and may be regarded as the reduction in uncertainty in a random
variable given the knowledge available about another one. The mutual
information between two discrete random variables X and Y is defined as

I(X;Y)=∑x∈X∑y∈Yp(x, y) log p(x, y)/p(x)p(y)

where p(x, y) is the joint probability distribution and p(x) and p(y) are
the marginal probability distributions. The mutual information may also
be expressed as the Kullback–Leibler (KL) divergence between the joint
distribution and the product of the marginal distributions of the random
variables X and Y
I(X;Y)=DKL(p(x, y)∥p(x)p(y))
Computing the mutual information of two random variables is in general
challenging. Exact computation can be done in the discrete setting,
where the sum in Equation (1) can be calculated exhaustively, or in the
continuous setting, provided the probability distributions are known in
advance. Several methods to estimate the mutual information have been
introduced. The most common ones are non-parametric, including
discretizing the values using binning and resorting to non-parametric
kernel-density estimators

method was introduced in to find a maximally compressed


representation of an input random variable, X, which is obtained as a
function of a relevant random variable, Ψ, such that it preserves as much
information as possible on Ψ. Let us denote X^ as the compressed
version of the input random variable X parameterized by θ. What the IB
method effectively does is to solve the following optimization problem
The goal of any supervised learning method is to efficiently capture the
relevant information about an input random variable, typically a label for
classification tasks, in the output random variable . Let us consider a
DNN for image recognition, with input X, output Y^, and i-th hidden layer
denoted by hi. The classification task is related to the interpretation of an
image that is generated from a relevant random variable Y.
In case the hidden layer, hi, only processes the output of the previous
layer, hi−1, the layers form a Markov chain of successive internal
representations of the input. By applying the Data Processing Inequality
(DPII

DNNs may be analyzed through mutual information values on the


information plane. However, estimating mutual information across
layers in DNNs is a nontrivial task, as the outputs of the hidden layer
neurons are continuous-valued high-dimensional random vectors.

A further difficulty arises if the input X is a continuous random variable


and the neural network is deterministic.
Here I(X; hi)=∞ for commonly used activation functions. For the image
classification problem considered here, one might argue that the input
random variable is a discrete random variable, where the pixels have a
discrete distribution, and hi is also a discrete random variable given by a
deterministic transformation of X via finite-precision arithmetic. The
training and test sets, however, have a cardinality that is typically much
lower than the alphabet size of X, thus rendering the estimation of the
mutual information very difficult

Non-Parametric Estimation of Mutual Information


the most common methods for the estimation of mutual information are
non-parametric. As our focus is on CNNs, we consider a VGG-16 network
to evaluate the effectiveness of non-parametric estimation methods. The
loss function adopted to train the network is the cross-entropy loss
obtained from the softmax output probabilities, that is
LCE=∑m=1Mymlogpm

where ym is the binary value corresponding to the m-th class of a one-hot


encoded vector defined by the class labels, pm is the softmax probability
output for class m, and M is the number of classes in the classification
task. In all experiments the dataset

Activation Binning

The mutual-information estimation method adopted in resorts to


binning the activations of each layer into equal-sized bins. Binning
ensures that the values are discretized, which allows to calculate the
joint distribution and marginal distributions by counting the activations
in each bin.

Despite the promising results reported in this method has a number of


limitations as a general method for mutual-information estimation.
Firstly, the experiments in were conducted using a fully connected
network with only a few neurons, which has far fewer synaptic weights
and neurons than typical CNN architectures. Another shortcoming is the
usage of the tanh function as a non-linear activation function, which
bounds the activations from −1 to 1. The ReLU activation function is
more commonly used and allows unbounded positive activations. This
limitation is pointed out in where the authors showed counterexamples
of the compression phase using ReLU non-linear activations..

We design a provably accurate estimator of I(X; T`) inspired by our recent work on
differential entropy estimation Given a feature dataset X = {xi}i∈[n] i.i.d. according
to PX, our I(X; T`) estimator is constructed from estimators of h(T`) and h(pT`|
X=x), ∀x ∈ X . We propose a sampling method that reduces the estimation of each
entropy to the framework from Using entropy estimation risk bounds derived
therein we control the error of our sample propagation (SP) estimator ˆISP of I(X;
T`). Each differential entropy is estimated and computed via a two-step process.
First, we approximate each true entropy by the differential entropy of a known
Gaussian mixture (defined only through the available resources: the obtained
samples and the noise parameter). This estimate converges to the true value at
the parametric estimation rate, uniformly in the dimension. However, since the
Gaussian mixture entropy has no closed-form expression, in the second
(computational) step we develop a Monte Carlo integration (MCI) method to
numerically evaluate it. Mean squared error (MSE) bounds on the MCI computed
value are established

Rényi’s α-Entropy
A multivariate matrix-based Rényi’s α-entropy method was proposed in
for application to a LeNet-5 network. This approach is suitable for CNN
networks, as each CNN layer has several channels, which all contain
some information on the input and output random variables. However,
two distinct channels of a single layer can contain the same information
on the input and output random variables. Therefore, summing up the
mutual information estimates between each channel and the input or
output random variable only gives an upper bound on the mutual
information that has little relevance to the true mutual information
value. The experiments in for LeNet-5 result in mutual information
estimates for the various layers that satisfy the DPI. However, our
experiments for VGG-16, which has up to 512 channels, did not yield
estimates that comply with the DPI.

You might also like