Deep Learning Notes
Deep Learning Notes
UNIT-1
Deep learning is a branch of machine learning which is based on artificial neural
networks. It is capable of learning complex patterns and relationships within
data. In deep learning, we don’t need to explicitly program everything. It has
become increasingly popular in recent years due to the advances in processing
power and the availability of large datasets. Because it is based on artificial
neural networks (ANNs) also known as deep neural networks (DNNs). These
neural networks are inspired by the structure and function of the human brain’s
biological neurons, and they are designed to learn from large amounts of data.
1. Deep Learning is a subfield of Machine Learning that involves the use of
neural networks to model and solve complex problems. Neural networks are
modeled after the structure and function of the human brain and consist of
layers of interconnected nodes that process and transform data.
2. The key characteristic of Deep Learning is the use of deep neural networks,
which have multiple layers of interconnected nodes. These networks can
learn complex representations of data by discovering hierarchical patterns
and features in the data. Deep Learning algorithms can automatically learn
and improve from data without the need for manual feature engineering.
3. Deep Learning has achieved significant success in various fields, including
image recognition, natural language processing, speech recognition, and
recommendation systems. Some of the popular Deep Learning architectures
include Convolutional Neural Networks (CNNs), Recurrent Neural Networks
(RNNs), and Deep Belief Networks (DBNs).
4. Training deep neural networks typically requires a large amount of data and
computational resources. However, the availability of cloud computing and
the development of specialized hardware, such as Graphics Processing
Units (GPUs), has made it easier to train deep neural networks.
In summary, Deep Learning is a subfield of Machine Learning that involves the
use of deep neural networks to model and solve complex problems. Deep
Learning has achieved significant success in various fields, and its use is
expected to continue to grow as more data becomes available, and more
powerful computing resources become available.
Dataset
The dataset is the first ingredient in training a neural network — the data itself along with the
problem we are trying to solve define our end goals. For example, are we using neural networks
to perform a regression analysis to predict the value of homes in a specific suburb in 20 years? Is
our goal to perform unsupervised learning, such as dimensionality reduction? Or are we trying to
perform classification?
In the context of this book, we’re strictly focusing on image classification; however, the
combination of your dataset and the problem you are trying to solve influences your choice in
loss function, network architecture, and optimization method used to train the model. Usually,
we have little choice in our dataset (unless you’re working on a hobby project) — we are given a
dataset with some expectation on what the results from our project should be. It is then up to us
to train a machine learning model on the dataset to perform well on the given task.
Loss Function
Given our dataset and target goal, we need to define a loss function that aligns with the problem
we are trying to solve. In nearly all image classification problems using deep learning, we’ll be
using cross-entropy loss. For >2 classes we call this categorical cross-entropy. For two class
problems, we call the loss binary cross-entropy.
Model/Architecture
Your network architecture can be considered the first actual “choice” you have to make as an
ingredient. Your dataset is likely chosen for you (or at least you’ve decided that you want to
work with a given dataset). And if you’re performing classification, you’ll in all likelihood be
using cross-entropy as your loss function.
However, your network architecture can vary dramatically, especially when with which
optimization method you choose to train your network. After taking the time to explore your
dataset and look at:
You should start to develop a “feel” for a network architecture you are going to use. This takes
practice as deep learning is part science, part art.
Keep in mind that the number of layers and nodes in your network architecture (along with any
type of regularization) is likely to change as you perform more and more experiments. The more
results you gather, the better equipped you are to make informed decisions on which techniques
to try next.
Optimization Method
Even despite all these newer optimization methods, SGD is still the workhorse of deep learning
— most neural networks are trained via SGD, including the networks obtaining state-of-the-art
accuracy on challenging image datasets such as ImageNet.
When training deep learning networks, especially when you’re first getting started and learning
the ropes, SGD should be your optimizer of choice. You then need to set a proper learning rate
and regularization strength, the total number of epochs the network should be trained for, and
whether or not momentum (and if so, which value) or Nesterov acceleration should be used.
Take the time to experiment with SGD as much as you possibly can and become comfortable
with tuning the parameters.
Becoming familiar with a given optimization algorithm is similar to mastering how to drive a car
— you drive your own car better than other people’s cars because you’ve spent so much time
driving it; you understand your car and its intricacies. Oftentimes, a given optimizer is chosen to
train a network on a dataset not because the optimizer itself is better, but because the driver (i.e.,
deep learning practitioner) is more familiar with the optimizer and understands the “art” behind
tuning its respective parameters.
Keep in mind that obtaining a reasonably performing neural network on even a small/medium
dataset can take 10’s to 100’s of experiments even for advanced deep learning users — don’t be
discouraged when your network isn’t performing extremely well right out of the gate. Becoming
proficient in deep learning will require an investment of your time and many experiments — but
it will be worth it once you master how these ingredients come together.
A Formal Proof of the Expressiveness of Deep Learning
Deep learning has had a profound impact on computer science in recent years, with applications to
image recognition, language processing, bioinformatics, and more. Recently, Cohen et al. provided
theoretical evidence for the superiority of deep learning over shallow learning. We formalized their
mathematical proof using Isabelle/HOL. The Isabelle development simplifies and generalizes the original
proof, while working around the limitations of the HOL type system. To support the formalization, we
developed reusable libraries of formalized mathematics, including results about the matrix rank, the
Borel measure, and multivariate polynomials as well as a library for tensor analysis.
Isabelle [33,34] is a generic proof assistant that supports many object logics. The metalogic is based on
an intuitionistic fragment of Church’s simple type theory [14]. The types are built from type variables α,
β, … and n-ary type constructors, normally written in postfix notation (e.g., α list). The infix type
constructor α ⇒ β is interpreted as the (total) function space from α to β. Function applications are
written in a curried style (e.g., f x y). Anonymous functions x → yx are written λx. yx. The notation t :: τ
indicates that term t has type τ . Isabelle/HOL is an instance of Isabelle. Its object logic is classical higher-
order logic supplemented with rank-1 polymorphism and Haskell-style type classes. The distinction
between the metalogic and the object logic is important operationally but not semantically. Isabelle’s
architecture follows the tradition of the theorem prover LCF [21] in implementing a small inference
kernel that verifies the proofs. Trusting an Isabelle proof involves trusting this kernel, the formulation of
the main theorems, the assumed axioms, the compiler and runtime system of Standard ML, the
operating system, and the hardware. Specification mechanisms help us define important classes of types
and functions, such as inductive datatypes and recursive functions, without introducing axioms. Since
additional axioms can lead to inconsistencies, it is generally good style to use these mechanisms. Our
formalization is mostly written in Isar [43], a proof language designed to facilitate the development of
structured, human-readable proofs. Isar proofs allow us to state intermediate proof steps and to nest
proofs. This makes them more maintainable than unstructured tactic scripts, and hence more
appropriate for substantial formalizations. Isabelle locales are a convenient mechanism for structuring
large proofs. A locale fixes types, constants, and assumptions within a specified scope. For example, an
informal mathematical text stating “in this section, let A be a set of natural numbers and B a subset of A”
could be formalized by introducing a locale AB_subset as follows:
Mathematical Preliminaries
We provide a short introduction to tensors and the Lebesgue measure. We expect familiarity with basic
matrix and polynomial theory. 3.1 Tensors Tensors can be understood as multidimensional arrays, with
vectors and matrices as the oneand two-dimensional cases. Each index corresponds to a mode of the
tensor. For matrices, the modes are called “row” and “column”. The number of modes is the order of
the tensor. The number of values an index can take in a particular mode is the dimension in that mode.
Thus, a real-valued tensor A ∈ RM1×···×MN of order N and dimension Mi in mode i contains values
Ad1,...,dN ∈ R for di ∈ {1,..., Mi}. Like for vectors and matrices, addition + is defined as componentwise
addition for tensors of identical dimensions. The product ⊗ : RM1×···×MN × RMN +1×···×MN −→
RM1×···×MN is a binary operation that generalizes the outer vector product. For real tensors, it is
associative and distributes over addition. The canonical polyadic rank, or CP-rank, associates a natural
number with a tensor, generalizing the matrix rank. The matricization [A ] of a tensor A is a matrix
obtained by rearranging A ’s entries using a bijection between the tensor and matrix entries. It has the
following property:
The Lebesgue measure is a mathematical description of the intuitive concept of length, surface, or
volume. It extends this concept from simple geometrical shapes to a large amount of subsets of Rn,
including all closed and open sets, although it is impossible to design a measure that caters for all
subsets of Rn while maintaining intuitive properties. The sets to which the Lebesgue measure can assign
a volume are called measurable. The volume that is assigned to a measurable set can be a nonnegative
real number or ∞. A set of Lebesgue measure 0 is called a null set. If a property holds for all points in Rn
except for a null set, the property is said to hold almost everywhere
The following lemma [13] about polynomials will be useful for the proof of the fundamental theorem of
network capacity.
Lemma 2 If p ≡ 0 is a polynomial in d variables, the set of points x ∈ Rd with p(x) = 0 is a Lebesgue null
set
Types of Data
Deep learning can be applied to any data type. The data types you work with,
and the data you gather, will depend on the problem you’re trying to solve.
Use Cases
Deep learning can solve almost any problem of machine perception, including
classifying data , clustering it, or making predictions about it.
Classification: This image represents a horse; this email looks like spam; this transaction is
fraudulent
Clustering: These two sounds are similar. This document is probably what user X is looking
for
Predictions: Given their web log activity, Customer A looks like they are going to stop using
your service
Deep learning is best applied to unstructured data like images, video, sound or
text. An image is just a blob of pixels, a message is just a blob of text. This data
is not organized in a typical, relational database by rows and columns. That
makes it more difficult to specify its features manually.
Common use cases for deep learning include sentiment analysis, classifying
images, predictive analytics, recommendation systems, anomaly detection and
more.
If you are not sure whether deep learning makes sense for your use case,
please get in touch.
Data Attributes
For deep learning to succeed, your data needs to have certain characteristics.
Relevancy
The data you use to train your neural net must be directly relevant to your
problem; that is, it must resemble as much as possible the real-world data you
hope to process. Neural networks are born as blank slates, and they only learn
what you teach them. If you want them to solve a problem involving certain kinds
of data, like CCTV video, then you have to train them on CCTV video, or
something similar to it. The training data should resemble the real-world data that
they will classify in production.
Proper Classification
If a client wants to build a deep-learning solution that classifies data, then they
need to have a labeled dataset. That is, someone needs to apply labels to the
raw data: “This image is a flower, that image is a panda.” With time and tuning,
this training dataset can teach a neural network to classify new images it has not
seen before.
Formatting
Neural networks eat vectors of data and spit out decisions about those vectors.
All data needs to be vectorized, and the vectors should be the same length when
they enter the neural net. To get vectors of the same length, it’s helpful to have,
say, images of the same size (the same height and width). So sometimes you
need to resize the images. This is called data pre-processing.
Accessibility
The data needs to be stored in a place that’s easy to work with. A local file
system, or HDFS (the Hadoop file system), or an S3 bucket on AWS, for
example. If the data is stored in many different databases that are unconnected,
you will have to build data pipelines. Building data pipelines and performing
preprocessing can account for at least half the time you spend building deep-
learning solutions.
If you have labeled data (i.e. categories A, B, C and D), it’s preferable to have an
evenly balanced dataset with 25,000 instances of each label; that is, 25,000
instances of A, 25,000 instances of B and so forth.
Curse of Dimensionality
The Curse of Dimensionality sounds like it’s something straight out of the wizarding world, but it
really is only a very common term you’ll come across in the Machine Learning and Big Data
world. This term describes how the increase in input data dimensions results in an exponential
increase in computational expense and efforts required for processing and analyzing that data.
This leads to overfitting of the model, so even though the model performs really well on
training data, it fails drastically on any real data.
Quite a few algorithms work only on tall, svelte datasets with fewer features and more
samples. Hence, to remove the curse afflicting your model, you might need to put your
data on a diet – i.e., reduce its dimensions through feature selection and feature
engineering techniques.
Dimensionality Reduction to the Rescue
What you need to understand first is that data features are usually correlated. Hence,
the higher dimensional data is dominated by a rather small number of features. If we
can find a subset of the superfeatures that can represent the information just as well as
the original dataset, we can remove the curse of dimensionality!
This is what dimensionality reduction is – a process of reducing the dimension of your
data to a few principal features.
Fewer input dimensions often correspond to a simpler model, referred to as its degrees
of freedom. A model with larger degrees of freedom is more prone to overfitting. So, it
is desirable to have more generalized models, and input data with fewer features.
Why is Dimensionality Reduction necessary?
Avoids overfitting – the lesser assumptions a model makes, the simpler it will be.
Easier computation – the lesser the dimensions, the faster the model trains.
Improved model performance – removes redundant features and noise, lesser misleading data
improves model accuracy.
Lower dimensional data requires less storage space.
Lower dimensional data can work with other algorithms that were unfit for larger dimensions.
How is Dimensionality Reduction done?
Several techniques can be employed for dimensionality reduction depending on the
problem and the data. These techniques are divided into two broad categories:
Feature Selection: Choosing the most important features from the data
Matrix Factorization
Matrix factorization is one of the most sought-after machine learning recommendation
models. It acts as a catalyst, enabling the system to gauge the customer’s exact
purpose of the purchase, scan numerous pages, shortlist, and rank the right product or
service, and recommend multiple options available. Once the output matches the
requirement, the lead translates into a transaction and the deal clicks.
Content-Based Filtering
This approach recommends items based on user preferences. It matches the
requirement, considering the past actions of the user, patterns detected, or any explicit
feedback provided by the user, and accordingly, makes a recommendation.
Example: If you prefer the chocolate flavor and purchase a chocolate ice cream, the
next time you raise a query, the system shall scan for options related to chocolate, and
then, recommend you to try a chocolate cake.
How does the System make recommendations?
Let us take an example. To purchase a car, in addition to the brand name, people check
for features available in the car, most common ones being safety, mileage, or aesthetic
value. Few buyers consider the automatic gearbox, while others opt for a combination of
two or more features. To understand this concept, let us consider a two-dimensional
vector with the features of safety and mileage.
1. In the above graph, on the left-hand side, we have cited individual preferences,
wherein 4 individuals have been asked to provide a rating on safety and mileage. If
the individuals like a feature, then we assign the value 1, and if they do not like that
particular feature, we assign 0. Now we can see that Persons A and C prefer safety,
Person B chooses mileage and Person D opts for both Safety and Mileage.
2. Cars are been rated based on the number of features (items) they offer. A ranking of
4 implies high features, and 1 depicts fewer features.
• A ranks high on safety and low on mileage
• B is rated high on mileage and low on safety
• C has an average rating and offers both safety and mileage
• D is low on both mileage and safety
3. The blue colored ? mark is the sparse value, wherein either person does not know
about the car or is not part of the consideration list for buying the car or has forgotten
to rate.
4. Let’s understand how the matrix at the center has arrived. This matrix represents the
overall rating of all 4 cars, given by the individuals. Person A has given an overall
rating of 4 for Car A, 1 to Cars B and D, and 2 to Car C. This has been arrived,
through the following calculations-
For Car A = (1 of safety preference x 4 of safety features) +(0 of mileage preference
x 1 of mileage feature) = 4;
For Car B = (1 of safety preference x 1 of safety features) +(0 of mileage preference
x 4 of mileage feature) = 1;
For Car C = (1 of safety preference x 2 of safety features) +(0 of mileage preference
x 2 of mileage feature) = 2;
For Car D = (1 of safety preference x 1 of safety features) +(0 of mileage preference
x 2 of mileage feature) = 1
Now on the basis of the above calculations, we can predict the overall rating for each
person and all the cars.
If person C is asking the search engine to recommend options available for cars, basis
the content available, and his preference, the machine will compute the table below:
In the above example, we had two matrices, that is, individual preferences and car
features, and 4 observations to enable the comparison. Now, if there are ‘n’ number of
observations in both matrix a and b, then-
Dot Product
The dot product of two length- n vectors a and b are defined as:
The dot product is only defined for vectors of the same length, which means there
should be equal number of ‘n’ observations.
Cosine
There are alternative approaches or algorithms available to solve a problem. Cosine
is used as an approach to measure similarities. It is used to determine the nearest
user to provide recommendations. In our chocolate example above, the search
engine can make such a recommendation using cosine. It measures similarities
between two non zero vectors. The similarity between two vectors a and b is, in fact,
the ratio between their dot product and the product of their magnitudes.
By applying the definition of similarity, this will equal 1 if the two vectors are identical,
and it will be 0, if the two are orthogonal. In other words, similarity is a number ranging
from 0 to 1 and tells us the extent of similarity between the two vectors. Closer to 1 is
highly similar and closer to 0 is dissimilar.
The higher the TF-IDF score (weight), more important the term, and vice versa.
The model does not require any data about other users, since the recommendations are
specific to one user. This makes it easier to scale it up to a large number of users.
a) The model can capture specific interests of a user, and can recommend niche items
that
very few other users are interested in.
a) Since the feature representation of the items are hand-engineered to some extent,
this technique requires a lot of domain knowledge. Therefore, the model can only be as
good as the hand-engineered features.
b) The model can only make recommendations based on existing interests of the user.
In other words, the model has limited ability to expand on the users’ existing interests.
Collaborative Filtering
This approach uses similarities between users and items simultaneously, to provide
recommendations. It is the idea of recommending an item or making a prediction,
depending on other like-minded individuals. It could comprise a set of users, items,
opinions about an item, ratings, reviews, or purchases.
Example: Suppose Persons A and B both like the chocolate flavor and have them have
tried the ice-cream and cake, then if Person A buys chocolate biscuits, the system will
recommend chocolate biscuits to Person B.
In collaborative filtering, we do not have the rating of individual preferences and car
preferences. We only have the overall rating, given by the individuals for each car. As
usual, the data is sparse, implying that the person either does not know about the car or
is not under the consideration list for buying the car, or has forgotten to give the rating.
The task in hand is to predict the rating that Person C might assign to Car C (? marked
in yellow) basis the similarities in ratings given by other individuals. There are three
steps involved to arrive at a collaborative filtering recommendation.
Step-1
Normalization: Usually while assigning a rating, individuals tend to give either a high
rating or a low rating, across all parameters. Normalization usually helps in balancing
and evens out such measures. This is done by taking an average of rating available and
subtracting it with the individual rating(x- x̅ )
a) In case of Person A = (4+1+2+1)/4 = 2 = x̅ , In case of Person B = (1+4+2)/3 = 2.3 =̅x
Similarly, we can do it for Persons C and D.
b) Then we subtract the average with individual rating
In case of Person A it is, 4-2, 1-2, 2-2, 1-2, In case for Person B = 1-2.3,4-2.3,2-2.3
Similarly, we can do it for Persons C and D.
You get the below table for all the individuals.
If you add all the numbers in each row, then it will add up-to zero. Actually, we have
centered overall individual rating to zero. From zero, if individual rating for each car is
positive, it means he likes the car, and if it is negative it means he does not like the car.
Step-2
Matrix factorization assumes that each user and item can be represented by a vector of latent features,
such as genre, style, quality, or popularity. These features capture the underlying patterns and preferences
of the users and items. For example, a user who likes action movies may have a high value for the action
feature, while a movie that is a comedy may have a high value for the comedy feature. By multiplying the
user feature vector and the item feature vector, we can obtain an estimate of the rating that the user would
give to the item. The goal of matrix factorization is to find the optimal feature vectors that minimize the
difference between the estimated ratings and the actual ratings.
Matrix factorization has several advantages for recommender systems. First, it can handle sparse and
incomplete data, which is often the case for user-item ratings. Matrix factorization can fill in the missing
values and predict ratings for unseen items or users. Second, it can reduce the dimensionality and
complexity of the data, which can improve the efficiency and scalability of the recommender system.
Matrix factorization can compress a large matrix into smaller matrices that capture the essential
information and reduce the noise. Third, it can discover latent features and patterns that are not obvious or
explicit in the data. Matrix factorization can reveal hidden similarities and preferences among users and
items, which can enhance the quality and diversity of the recommendations.
Matrix factorization also has some disadvantages for recommender systems. First, it can suffer from
overfitting and underfitting, which can affect the accuracy and generalization of the recommendations.
Overfitting occurs when the feature vectors fit the data too well and capture the noise or outliers, while
underfitting occurs when the feature vectors fit the data too poorly and miss the important information. To
avoid these problems, matrix factorization needs to balance the trade-off between fitting the data and
regularizing the feature vectors. Second, it can be sensitive to the choice of parameters, such as the
number of features, the learning rate, and the regularization term. These parameters can influence the
performance and convergence of the matrix factorization algorithm. To find the optimal parameters,
matrix factorization needs to perform cross-validation or grid search, which can be time-consuming and
computationally expensive. Third, it can be limited by the linearity and independence assumptions, which
may not hold for some data or scenarios. Matrix factorization assumes that the rating is a linear
combination of the features, and that the features are independent of each other. However, in some cases,
the rating may depend on nonlinear or interactive features, or on external factors such as context, time, or
social influence. To address these limitations, matrix factorization may need to incorporate additional
information or techniques, such as nonlinear functions, feature engineering, or hybrid models.
In supervised learning, the machine is trained on a set of labeled data, which means that the input data is
paired with the desired output. The machine then learns to predict the output for new input data.
Supervised learning is often used for tasks such as classification, regression, and object detection.
In unsupervised learning, the machine is trained on a set of unlabeled data, which means that the input
data is not paired with the desired output. The machine then learns to find patterns and relationships in the
data. Unsupervised learning is often used for tasks such as clustering, dimensionality reduction, and
anomaly detection.
Supervised learning is a type of machine learning algorithm that learns from labeled data. Labeled data is
data that has been tagged with a correct answer or classification.
The Netflix Prize is likely the most famous example many data
scientists explored in detail for matrix factorization. Simply put, the
challenge was to predict a viewer’s rating of shows that they hadn’t
yet watched based on their ratings of shows that they had watched in
the past, as well as the ratings that other viewers had given a show.
This takes the form of a matrix where the rows are users, the columns
are shows, and the values in the matrix correspond to the rating of the
show. Most of the values in this matrix are missing, as users have not
rated (or seen) the majority of shows. Below is an illustrative example
of what this data tends to look like, where ratings are shown in purple
and missing values are denoted in white.
The best techniques for filling in these missing values can vary
depending on how sparse the matrix is. If there are not many missing
values, frequently the mean or median value of the observed values
can be used to fill in the missing values. If the data is categorical, as
opposed to numeric, the most frequently observed value is used to fill
in the missing values. This technique, while simple and fast, is fairly
naive, as it misses the interactions that can happen between the
variables in a data set. Most egregiously, if the data is numeric
and bimodal between two distant values, the mean imputation
method can create values that otherwise would not exist in the data
set at all. These improper imputed values increase the noise in the
data set the sparser it is, so this becomes an increasingly bad strategy
as sparsity increases.
Matrix factorization is a simple idea that tries to learn connections
between the known values in order to impute the missing values in a
smart fashion. Simply put, for each row and for each column, it
learns (k) numeric “factors” that represent the row or column. In the
example of the Netflix challenge, the factors for movies could
correspond to “is this a comedy?” or “is a big name actor/actress in
this movie?” The factors for users might correspond to “does this user
like comedies?” and “does this user like this actor/actress?” Ratings
are then determined by a dot product between these two smaller
matrices (the user factors and the movie factors) such that an
individual prediction for how user (i) will rate movie (j) is
calculated by:
�������,�=∑�=1������,�⋅������,�
This would line up the user factor “does this user like comedies?” with
the movie factor “is this a comedy?” and give a boost if both were
positive.
This tutorial will first introduce both traditional and deep matrix
factorization by using them to model synthetic data. We will then move
on to a real-world example and apply these concepts to model the
MovieLens 20M data set, which is conceptually identical to the Netflix
Prize challenge. Lastly, we’ll explore how one can use recent
advances in deep learning and the flexibility of neural networks to
build even more complex models. While this may sound complicated,
fortunately Apache MXNet makes it easy!
expressive power of Graph Neural Networks (GNNs) has been studied extensively through the
Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL
framework are inapplicable for geometric graphs embedded in Euclidean space, such as
biomolecules, materials, and other physical systems. In this work, we propose a geometric
version of the WL test (GWL) for discriminating geometric graphs while respecting the
underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL
to characterise the expressive power of geometric GNNs that are invariant or equivariant to
physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design
choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as
they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a
larger class of graphs by propagating geometric information beyond local neighbourhoods; (3)
Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4)
GWL's discrimination-based perspective is equivalent to universal approximation.
In recent years, artificial intelligence utilizing machine learning algorithms has begun to outperform
humans at several tasks, e.g., playing board games and diagnosing skin cancer These advances are now
being considered in safety-critical autonomous systems where software defects may cause severe harm
to humans and the environment, e.g, airborne collision avoidance systems Several researchers have
raised concerns regarding the lack of verification methods for these kinds of systems in which machine
learning algorithms are used to train software deployed in the system. Within the avionics sector,
guidelines describe how design organizations may apply formal methods to the verification of safety-
critical software. Applying these methods to complex and safety-critical software is a non-trivial task due
to practical limitations in computing power and challenges in qualifying complex verification tools. These
challenges are often caused by a high expressiveness provided by the language in which the software is
implemented in. Most research that apply formal methods to the verification of machine learning is so
far focused on the verification of neural networks, but there are other models that may be more
appropriate when verifiability is important, e.g., decision trees , random forests a
gradient boosting machines Where neural networks are subject to verification of correctness, they are
usually adopted in non-safety-critical cases, e.g., fairness with respect to individual discrimination,
where probabilistic guarantees are plausible [2]. Since our aim is to support safety arguments for digital
artefacts that may be deployed in hazardous situations, we need to rely on guarantees of absence of
misclassifications, or at least recognize when such guarantees cannot be provided. The tree-based
models provide such an opportunity since their structural simplicity makes them easy to analyze
systematically, but large (yet simple) models may still prove hard to verify due to combinatorial
explosion.
• A generalization to include more tree ensembles, e.g., gradient boosting machines, with an updated
tool support (VoTE). • A Soundness proof of the associated approximation technique used for this
purpose. • An improved node selection strategy that yields speed improvements in the range of 4 %–91
% for trees
Formal Verification of Neural Networks There has been extensive research on formal verification of
neural networks. Pulina and Tacchella [25] combine SMT solvers with an abstraction-refinement
technique to analyze neural networks with non-linear activation functions. They conclude that formal
verification of realistically sized networks is still an open challenge. Scheibler et al. [27] use bounded
model checking to verify a non-linear neural network controlling an inverted pendulum. They encode
the neural network and differential equations of the system as an SMT formula, and try to verify
properties without success.
Random matrix theory, at its inception, primarily dealt with the eigenvalue distribution (also
referred to as the spectral measure) of large dimensional random matrices. One of the key
technical tools to study these measures is the Stieltjes transform, often presented as the central
object of the theory [Bai and Silverstein, 2010, Pastur and Shcherbina, 2011]. But signal
processing and machine learning alike are often more interested in subspaces and eigenvectors
(which often carry the structural information of the data) than in eigenvalues. Subspace or
spectral methods, such as principal component analysis (PCA ], spectral clustering and some
semi-supervised learning technique are built directly upon the eigenspace spanned by the
several top eigenvectors. Consequently, beyond the Stieltjes transform, a more general
mathematical object, the resolvent of large random matrices will constitute the cornerstone of
the monograph. The resolvent of a matrix gives access to its spectral measure, to the location
of its isolated eigenvalues, to the statistical behavior of their associated eigenvectors when
random, and consequently provides an entry-door to the performance analysis of numerous
machine learning methods. This chapter introduces the fundamental objects and tools
necessary to characterize the behavior of large dimensional random matrices (the resolvent,
the Stieltjes transform method, etc.) in Section 2.1, with a particular focus on the modern and
powerful technical approach of deterministic equivalents. Section 2.2 then presents some
foundational random matrix results
Modern machine learning tasks are often high-dimensional, due to the large amount of
data, features, and trainable parameters. Mathematical tools such as random matrix
theory have been developed to precisely study simple learning models in the high-
dimensional regime, and such precise analysis can reveal interesting phenomena that
are also empirically observed in deep learning.
In this talk, Denny Wu from the University of Toronto will introduce two examples
concluded by him and his research team. He will first talk about the selection of
regularisation hyperparameters in the over-parameterised regime and how the team
established a set of equations that rigorously describes the asymptotic generalisation
error of the ridge regression estimator. These equations led to some surprising findings:
(i) the optimal ridge penalty can be negative and (ii) regularisation can suppress
“multiple descents” in the risk curve. He will then discuss practical implications such as
the implicit bias of first- and second-order optimisers in neural network training.
Next, we will go beyond linear models and characterize the benefit of gradient-based
representation (feature) learning in neural networks. By studying the precise
performance of kernel ridge regression on the trained features in a two-layer neural
network, his team has proven that feature learning could result in a considerable
advantage over the initial random features model, highlighting the role of learning rate
scaling in the initial phase of gradient descent.
Deterministic and random equivalents This monograph is concerned with the situation where M is a
large dimensional random matrix, the eigenvalues and eigenvectors of which need be related to the
statistical nature of the model design of M. In the early days of random matrix theory, the main focus
was on the limiting spectral measure of M ∈ R n×n, that is the characterization of a certain “limit” to the
spectral measure µM of M as the size of M increases. For this purpose, a natural approach is to study
the random Stieltjes transform mµM(z) and to show that it admits a limit (in probability or almost
surely) m(z) as n → ∞. However, this method has strong limitations: (i) it supposes that such a limit
exists, therefore restricting the study to very regular models for M and (ii) it only quantifies 1 n tr QM
(through the Stieltjes transform), thereby discarding all subspace information about M carried in the
resolvent QM. As a consequence, a further study of the eigenvectors of M often requires a complete
rework. To avoid these limitations, modern random matrix theory focuses instead on the notion of
deterministic equivalents which are deterministic matrices – thus finite dimensional objects rather than
limits – having (in probability or almost surely) asymptotically the same scalar observations as the
random ones.2 In particular, these scalar observations of deterministic equivalents
Infinite-width-then-infinite-depth limit: in this case, the width is taken to infinity first, then the depth is
take to infinity. This is the infinite-depth limit of infinite-width neural networks. This limit was
particularly used to derive the Edge of Chaos initialization scheme
The joint infinite-width-and-depth limit: in this case, the depth-to-width ratio is fixed, and therefore, the
width and depth are jointly taken to infinity at the same time. There are few works that study the joint
width-depth limit.
The empirical success of modern Deep Neural Networks (DNNs) has sparked a growing interest in the
theoretical understanding of these models. An important development in this direction was the
introduction of the Neural Tangent Kernel (NTK) framework
which provides a dual view of Gradient Descent (GD) in function space. The NTK is the dot product
kernel of Tangent Features (gradient features), given by
K(x, x0 ) = ∇θf(x)∇θf(x 0 ) T ,
where f is the network output and θ is the vector of model parameters. The NTK has been the subject of
an extensive body of literature, both in the NTK regime where the NTK remains constant during training
Role of hidden layers. In the context of DNNs, hidden layers act as successive embeddings that evolve in
datadependent directions during training. However, it is still unclear how different layers contribute to
model performance. A possible approach in this direction is the study of feature learning in each layer,
i.e. how features adapt to data as we train the model. A simple way to do this is by measuring the
change in the values of pre-activations as we train the network;
For each layer l, they measured the alignment between the layer tangent kernel matrix Kˆ l (where Kl(x,
x0 ) = ∇θl f(x)∇θl f(x 0 ) T , θl being the vector of parameters in the l th layer). The kernel Kˆ l can be seen
as the NTK of the network if only the parameters of the l th layer are allowed to change (other layers are
frozen, see Section 2.1 for a detailed discussion). The authors demonstrated the existence of an
alignment pattern in which the tangent features of some hidden layers are significantly more aligned
with data labels compared to other layers
In general, the alignment reaches its maximum for some hidden layer and tends to be minimal in the
external layers (first and last). To the best of our knowledge, no explanation has been provided for this
phenomenon in the literature. In this paper, we propose an explanation based on the theory of signal
propagation in randomly initialized neural networks. Our intuition is based on the observation that
tangent kernels can be decomposed as an Hadamard product of two quantities linked to how signal
propagates in DNNs.
Random DNN
Deep Neural Networks (DNNs) are a fundamental tool in the modern development of
Machine Learning. Beyond the merits of the training algorithms, a great part of DNNs
success is due to the inherent properties of their layered architectures, i.e., to the
introduced architectural biases. In this tutorial, we explore recent classes of DNN
models wherein the majority of connections are randomized or more generally fixed
according to some specific heuristic.
Limiting the training algorithms to operate on a reduced set of weights implies intriguing
features. Among them, the extreme efficiency of the learning processes is undoubtedly
a striking advantage with respect to fully trained architectures. Besides, despite the
involved simplifications, randomized neural systems possess remarkable properties
both in practice, achieving state-of-the-art results in multiple domains, and theoretically,
allowing to analyze intrinsic properties of neural architectures.
This tutorial covers all the major aspects regarding Deep Randomized Neural Networks,
from feed-forward and convolutional neural networks, to dynamically recurrent deep
neural systems for structures. The tutorial is targeted to both researchers and
practitioners, from academia or industry, who is interested in developing DNNs that can
be trained efficiently, and possibly embedded into low-powerful devices.
Randomization can enter the design of Deep Neural Nets in several disguises, including
training algorithms, regularization, etc. We focus on “Random-weights” neural
architectures where part of the weights are left untrained after initialization.
Jacobian Matrix
The Jacobian matrix and determinant are fundamental mathematical concepts that play a crucial
role in understanding the relationships between variables in machine learning models. In this blog
post, we’ll delve into these concepts, explaining what they are, how they are calculated, and their
significance in the context of machine learning. We’ll also provide practical examples and
demonstrate how to compute the Jacobian matrix and determinant using Python.
The Jacobian matrix is a mathematical construct that captures the relationships between multiple
input variables and multiple output variables in a system. It is particularly useful in scenarios
where a function or transformation has multiple inputs and multiple outputs. In the context of
machine learning, the Jacobian matrix helps us analyze how changes in the input variables affect
the output variables.
To compute the Jacobian matrix, we need to calculate the partial derivatives of each output
variable with respect to each input variable. These partial derivatives capture the rate of change or
predicts the price of a house based on its size (in square feet) and the number of bedrooms. The
model takes these two features as inputs and produces a single output, which is the predicted price
of the house.
To calculate the Jacobian matrix, we need to compute the partial derivatives of the output (price)
with respect to each input variable (size and number of bedrooms). Let’s assume we have the
following data:
House 1:
Bedrooms: 3
Price: $250,000
House 2:
Bedrooms: 4
Price: $320,000
To calculate the partial derivative with respect to size, we can calculate the change in price for a
small change in size while keeping the number of bedrooms constant. Let’s say we increase the
Bedrooms: 3
Price: $270,000
Bedrooms: 4
Price: $340,000
Now, we can calculate the partial derivative with respect to size for each house:
Partial derivative of price with respect to size (House 1): = (New Price — Old Price) / (New Size
— Old Size) = ($270,000 — $250,000) / (1600–1500) = $20,000 / 100 = $200 per square foot
Partial derivative of price with respect to size (House 2): = (New Price — Old Price) / (New Size
— Old Size) = ($340,000 — $320,000) / (2100–2000) = $20,000 / 100 = $200 per square foot
Similarly, we can calculate the partial derivative with respect to the number of bedrooms by
matrix. In this case, the Jacobian matrix would be a 2x1 matrix (2 inputs and 1 output) containing
variables in a machine learning model. By examining the values in the Jacobian matrix, we can
understand how changes in the input variables impact the output variables. A large value in the
Jacobian matrix entry indicates a strong influence, while a small value indicates a weak influence.
For instance, in our example, if the partial derivative of price with respect to size has a large
magnitude, it suggests that the size of the house has a significant impact on its price prediction.
On the other hand, if the partial derivative of price with respect to the number of bedrooms has a
small magnitude, it indicates that the number of bedrooms has a relatively weak influence on the
price prediction.
Analyzing the Jacobian matrix can help us uncover important insights about feature importance,
model behavior, and optimization. It allows us to identify the most influential input variables and
DNN Gaussian
A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit
Bayesian networks are a modeling tool for assigning probabilities to events, and thereby
characterizing the uncertainty in a model's predictions. Deep learning and artificial neural
networks are approaches used in machine learning to build computational models which learn
from training examples. Bayesian neural networks merge these fields. They are a type of neural
network whose parameters and predictions are both probabilistic.[9][10] While standard neural
Computation in artificial neural networks is usually organized into sequential layers of artificial
neurons. The number of neurons in a layer is called the layer width. When we consider a
sequence of Bayesian neural networks with increasingly wide layers (see figure), they converge
in distribution to a NNGP. This large width limit is of practical interest, since the networks often
improve as layers get wider. And the process may give a closed form way to evaluate networks.
NNGPs also appears in several other contexts: It describes the distribution over predictions made
by wide non-Bayesian artificial neural networks after random initialization of their parameters,
but before training; it appears as a term in neural tangent kernel prediction equations; it is used
computed by the neural network. A prior distribution over neural network parameters
therefore corresponds to a prior distribution over functions computed by the network. As neural
networks are made infinitely wide, this distribution over functions converges to a Gaussian
process for many architectures.
The notation used in this section is the same as the notation used below to derive the
correspondence between NNGPs and fully connected networks, and more details can be found
there.
Orthogonal initialization
Orthogonal initialization initializes the weight matrix with an orthogonal
matrix. This method helps to avoid the vanishing and exploding
gradients problem that can occur in deep neural networks.
UNIT-3
Dl through lense of information Theory-
In the case of deep learning, the most common use case for
information theory is to characterize probability distributions and to
quantify the similarity between two probability distributions. We use
concepts like KL divergence and Jensen-Shannon divergence for
these purposes.
Information Theory is quite vast, but I’ll try to summarise key bits of
information(pun not intended) in a short glossary.
1. Entropy
H= −∑ ipilog2(pi))
H(X)=EX[I(x)]=−∑x∈Xp(x)logp(x)
3. Conditional Entropy
H(X∣Y)=H(X,Y)−H(Y)
4. Mutual Information
I(X;Y)=EX,Y[SI(x,y)]=∑x,yp(x,y) logp(x,y)/p(x)p(y)
I(X;Y)=H(X)−H(X∣Y)
Intuitively, this means that knowing Y�, we can save an
average of I(X;Y)�(�;�) bits in encoding X� compared to
not knowing Y�.
I(X;Y)=I(Y;X)=H(X)+H(Y)−H(X,Y).
5- Markov Chain
A Markov property (also called the memoryless property) of a
stochastic process refers to the dependence of conditional
probability distributions of future states of the process only on
the present states and not on the past states. A process with
this property is called a Markov process. A Markov Chain is a
type of Markov process that has a discrete state space.
7- Reparametrization invariance
Kullback-Leibler Divergence
Now that we have understood Entropy and Cross Entropy, we are in a
position to understand another concept with a scary name. The name is so
scary that even practitioners call it KL Divergence instead of the actual
Kullback-Leibler Divergence. This monster from the depths of information
theory is a pretty simple and easy concept to understand.
The Math
Let’s take a look at how we arrive at the formula, which will enhance our
understanding of KL Divergence.
We know that from our intuition that KL(P||Q) is the information lost when
approximating P with Q. What would that be?
Just to recap the analogy of our example to an ML system to help you connect
the dots better. Let’s use the situation where Australia is sending messages to
India using the same coding scheme as was devised for messages from India
to Australia.
The box with balls(in India and Australia) are probability distributions
When we prepare a machine learning model(classification or regression),
what we are doing is approximating a probability distribution and in most
cases output the expected value of the probability distribution.
Devising a coding scheme based on a probability distribution is like
preparing a model to output that distribution.
The box with balls in India is the predicted distribution, because this is the
distribution which we assume when we devise the coding scheme.
The box with the balls in Australia is the true distribution, because this is
where the coding scheme is used.
We know the Expected Value of Bits when we use the coding scheme in
Australia. It’s the Cross Entropy of Q w.r.t. P (where P is the predicted
distribution and Q is the true distribution).
PYTORCH
All the Tensorflow 2.0 losses expects probabilities as the input by default, i.e.
logits to be passed through Sigmoid or Softmax before feeding to the losses.
But they provide a parameter from_logits which is set to True will accept logits
and do the conversion to probabilities inside.
where p(x, y) is the joint probability distribution and p(x) and p(y) are
the marginal probability distributions. The mutual information may also
be expressed as the Kullback–Leibler (KL) divergence between the joint
distribution and the product of the marginal distributions of the random
variables X and Y
I(X;Y)=DKL(p(x, y)∥p(x)p(y))
Computing the mutual information of two random variables is in general
challenging. Exact computation can be done in the discrete setting,
where the sum in Equation (1) can be calculated exhaustively, or in the
continuous setting, provided the probability distributions are known in
advance. Several methods to estimate the mutual information have been
introduced. The most common ones are non-parametric, including
discretizing the values using binning and resorting to non-parametric
kernel-density estimators
Activation Binning
We design a provably accurate estimator of I(X; T`) inspired by our recent work on
differential entropy estimation Given a feature dataset X = {xi}i∈[n] i.i.d. according
to PX, our I(X; T`) estimator is constructed from estimators of h(T`) and h(pT`|
X=x), ∀x ∈ X . We propose a sampling method that reduces the estimation of each
entropy to the framework from Using entropy estimation risk bounds derived
therein we control the error of our sample propagation (SP) estimator ˆISP of I(X;
T`). Each differential entropy is estimated and computed via a two-step process.
First, we approximate each true entropy by the differential entropy of a known
Gaussian mixture (defined only through the available resources: the obtained
samples and the noise parameter). This estimate converges to the true value at
the parametric estimation rate, uniformly in the dimension. However, since the
Gaussian mixture entropy has no closed-form expression, in the second
(computational) step we develop a Monte Carlo integration (MCI) method to
numerically evaluate it. Mean squared error (MSE) bounds on the MCI computed
value are established
Rényi’s α-Entropy
A multivariate matrix-based Rényi’s α-entropy method was proposed in
for application to a LeNet-5 network. This approach is suitable for CNN
networks, as each CNN layer has several channels, which all contain
some information on the input and output random variables. However,
two distinct channels of a single layer can contain the same information
on the input and output random variables. Therefore, summing up the
mutual information estimates between each channel and the input or
output random variable only gives an upper bound on the mutual
information that has little relevance to the true mutual information
value. The experiments in for LeNet-5 result in mutual information
estimates for the various layers that satisfy the DPI. However, our
experiments for VGG-16, which has up to 512 channels, did not yield
estimates that comply with the DPI.