Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Mod-1 Part 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 143

DEEP LEARNING

23DS5PCDLG
Syllabus
Sem V
Course Title: Deep Learning
Course Code: 23DS5PCDLG Total Contact Hours: 50 hours
Total
L-T-P: 4-0-1 5
Credits:
Laboratory Plan

Sl. No. Lab Program

1 Write a program to implement XOR gates using Perceptron.


2 Design a deep NN, optimize the network with Gradient Descent, and optimize the
same with Stochastic gradient descent(SGD).

3 Classification of MNIST Dataset using CNN.


4 Implement Region-Based CNN for object detection.
5 Implement RNN for handwriting digit recognition.
6 Implement Bidirectional RNNs for music generation.
7 Implement Bidirectional LSTM for sentiment analysis.
8 Implement Variational Autoencoders for image-denoising.
9 Implementation of a Restricted Boltzmann Machine (RBM) that demonstrates
stacking.
10 Implement Generative Adversarial Networks to generate realistic photographs.
Teaching a child to recognize objects
• Neurons as Students: Deep learning models have layers of "neurons"
similar to how the child has different senses and thought processes. These
neurons are connected and work together to process information.
• Learning from Examples: You show the child many pictures of dogs and
cats. At first, they might not know how to distinguish between the two. But
with enough examples,(lots of data) the child begins to notice
patterns—like dogs usually have longer snouts, or cats have pointy ears.
• Layers of Understanding: The child might first recognize simple patterns
like shapes or colors(early layers in deep learning model). As they practice,
they move from simple features to more complex ones—identifying
specific characteristics like fur texture or the shape of eyes (deeper layers
of a model capture more detailed features).
• Training through Mistakes: Each time the child makes a mistake, you
gently correct them. Over time, their understanding improves
(backpropagation, where the model adjusts itself based on errors
during training).
• Autonomous Recognition: Eventually, the child becomes so good at
recognizing dogs and cats that they no longer need help—just like a
well-trained deep learning model can make accurate predictions
independently, even with new data it has never seen before.
• Deep learning is like teaching a child: it involves exposing the model
(child) to lots of examples, helping it learn patterns step by step, and
correcting it until it can make decisions independently.
Overview
• Introduction to Artificial Neural Networks: • Introduction to Deep Learning:
• Challenges motivating Deep Learning
• From Biological to Artificial Neurons:
• Historical Trends in Deep Learning
• Biological Neurons
• Deep Feedforward Networks
• Logical Computations with Neurons • Gradient-based Learning
• The Perceptron • Efficient Computation
• The Multilayer Perceptron and Backpropagation

• Regression and Classification MLPs

• Implementing MLPs with Keras

• Fine-tuning Neural Network Parameters


Introduction to Artificial Neural Networks(ANN)

• Key idea of artificial neural networks (ANNs):


• to look at the brain’s architecture for inspiration to build an
intelligent machine.
• ANNs are versatile, powerful, and scalable, making them ideal to
tackle large and highly complex Machine Learning tasks, such as
• classifying billions of images (e.g., Google Images),
• powering speech recognition services (e.g., Apple’s Siri),
• recommending the best videos to watch to hundreds of millions of users
every day (e.g., YouTube)
From Biological to Artificial Neurons
how artificial neural networks came to be!
Nice –to know

• First introduced in 1943 by the neurophysiologist Warren McCulloch


and the mathematician Walter Pitts in “A Logical Calculus of Ideas
Immanent in Nervous Activity”
• In the early 1980s, interest in connectionism(the study of neural
networks) was revived, as new architectures were invented and
better training techniques were developed.
• By the 1990s, powerful Machine Learning techniques, such as
Support Vector Machines, were invented. These techniques offered
better results and stronger theoretical foundations than ANNs.
Current wave of interest in ANNs.
this wave is different and has a much more profound impact on our lives

1. Data Availability: a huge quantity of data is available to train neural


networks, and ANNs frequently outperform other ML techniques
on very large and complex problems.
2. Increase in computing power: makes it possible to train large neural
networks in a reasonable amount of time using GPU cards.
3. The training algorithms have been improved- relatively small
tweaks have a huge positive impact.
4. ANN training algorithms were doomed because they were believed
to get stuck in local optima, but it turns out that they are usually
close to the global optimum.
5. Increased funding for ANN research, resulting in more and more
progress, and even more amazing products.
Biological Neurons
• Cells found in animal cerebral cortexes (brain)
• A cell body contains the nucleus and cell’s complex components.
• Branching extensions called dendrites
• One very long extension called the axon.
• The axon’s length is longer than the cell body.
• Near its extremity, the axon splits off into many branches called telodendria
• At the tip of these branches are called synaptic terminals (synapses), which are connected
to the dendrites of other neurons.
• Biological neurons receive short electrical impulses called signals from other neurons
via these synapses.
• When a neuron receives a sufficient number of signals from other neurons within a few
milliseconds, it fires its signals.
• Individual biological neurons are organized in a vast network of
billions of neurons, each neuron typically connected to thousands of
other neurons.
• Highly complex computations can be performed by a vast network of
fairly simple neurons
• Neurons are often organized in consecutive layers
Logical Computations with Neurons
• Artificial neuron has one or more binary (on/off) inputs and one
binary output.
• The artificial neuron activates its output when more than a certain
number of its inputs are active.
Examples:assume that a neuron is activated
when at least two of its inputs are active.
• Identity function
• If neuron A is activated, then
neuron C gets activated as well
(since it receives two input
signals from neuron A)
• If neuron A is off, then neuron C
is off as well
• logical AND: neuron C is
activated only when both
neurons A and B are activated
• neuron C is activated only if
neuron A is active and if neuron
B is off.
• logical OR: neuron C gets • If neuron A is active all the time,
activated if either neuron then you get a logical NOT:
neuron C is active when neuron
• A or neuron B is activated (or B is off, and vice versa.
both).
The Perceptron
• The Perceptron is a simple ANN
architecture, invented in 1957
by Frank Rosenblatt.
• based on a different artificial
neuron called a threshold logic • The TLU computes a weighted
unit (TLU)/linear threshold unit sum of its inputs
(LTU).
• Inputs and output are numbers • then applies a step function to
(instead of binary on/off values) that sum and outputs the result:
and each input connection is
associated with a weight.
• Most common step functions used in Perceptron:
• Heaviside step function and sign function
• A single TLU can be used for simple linear binary classification.
• It computes a linear combination of the inputs and if the result
exceeds a threshold, it outputs the positive class or else outputs the
negative class.(like a Logistic Regression classifier)
• A Perceptron is composed of a single layer of TLU, with each TLU
connected to all the inputs.
• When all the neurons in a layer are connected to every neuron in the
previous layer (i.e., its input neurons), it is called a fully connected
layer or a dense layer.
• Each input is sent to every TLU
with special passthrough neurons
called input neurons which output
whatever input they are fed.
• The input neurons form the
input layer.
• An extra bias feature is added (x0 =
1): represented using a special type
of neuron called a bias neuron,
which outputs 1 all the time.
• A Perceptron with two inputs and • This Perceptron can classify
three outputs is shown in diagram instances simultaneously into
three different binary classes,
which makes it a Multioutput-
classifier.
Computing the outputs of a fully connected
layer

• X represents the matrix of input features. It has one row per instance, one
column per feature.
• The weight matrix W contains all the connection weights except for the
ones from the bias neuron.
• It has one row per input neuron and one column per artificial neuron in the
layer.
• The bias vector b contains all the connection weights between the bias
neuron and the artificial neurons. It has one bias term per artificial neuron.
• The function ϕ is called the activation function: when the artificial neurons
are TLUs, it is a step function
How is a Perceptron trained?
• Hebb’s rule largely inspired the Perceptron training algorithm
proposed by Frank Rosenblatt.
• When a biological neuron often triggers another neuron, the
connection between these two neurons grows stronger.
• “Cells that fire together, wire together.”
• Hebbian learning: connection weight between two neurons is
increased whenever they have the same output.
• Perceptrons are trained using a variant of this rule that considers the
network’s error; it reinforces connections that help reduce the error.
• The Perceptron is fed one training instance at a time, and for each
instance it makes its predictions.
• For every output neuron that produced a wrong prediction, it
reinforces the connection weights from the inputs that would have
contributed to the correct prediction.
Perceptron learning rule (weight update)
• OR Gate and AND Gate perceptron calculations problems.
• import numpy as np
• from sklearn.datasets import load_iris
• from sklearn.linear_model import Perceptron
• iris = load_iris()
• X = iris.data[:, (2, 3)] # petal length, petal width
• y = (iris.target == 0).astype(np.int) # Iris Setosa?
• per_clf = Perceptron()
• per_clf.fit(X, y)
• y_pred = per_clf.predict([[2, 0.5]])
Nice-to-know
• Perceptron learning algorithm strongly resembles Stochastic Gradient
Descent.
• Scikit-Learn’s Perceptron class is equivalent to using an SGDClassifier
with the following hyperparameters: loss="perceptron",
learning_rate="constant", eta0=1 (the learning rate), an
penalty=None (no regularization).
• Perceptrons do not output a class probability; rather, they make
predictions based on a hard threshold.
• The perceptron is a basic building block in neural networks, but its
inability to handle non-linear problems, lack of depth, and
convergence issues mean that it's limited in complex, real-world
applications.
• These limitations are addressed in modern neural network
architectures with Multi-Layer Perceptron
• XOR Gate

• Linear classification is not possible as data is non-linearly separable.


• To solve the XOR problem, a more complex structure is needed, such as a
multi-layer perceptron (MLP) or neural network with at least one hidden
layer. The hidden layer allows the network to learn more complex,
non-linear decision boundaries, which can handle the XOR problem.
Multi-Layer Perceptron and Backpropagation
• A MLP is composed of
• one input layer
• one or more layers of TLUs, called hidden layers, and
• one final layer of TLUs called the output layer.
• lower layers: close to the input layer.
• upper layers: close to the outputs.
• Every layer except the output layer includes a bias neuron and is fully
connected to the next layer.
• feedforward neural network(FNN): signal flows only in one direction
(from the inputs to the outputs).
• deep neural network(DNN): When an ANN contains a deep stack of
hidden layers.
• To train MLPs- backpropagation training algorithm is used(Gradient
Descent)
• An efficient technique for computing the gradients automatically: in just two
passes through the network (one forward, one backward)
• The backpropagation algorithm can find out how each connection
weight and each bias term should be tweaked in order to reduce the
error.
Nice-to-know
• Gradient Descent is an optimization algorithm used to minimize a
function by iteratively moving towards the minimum of the function.
• Used in machine learning
to adjust the parameters of models (e.g., weights in neural networks)
to reduce the error between predictions and actual outputs.
• How it works: Finds the direction in which a function decreases the
most and follows that direction to minimize the function.
• What it requires: Two data points:- a direction and a learning rate.
• Update Rule: θ=θ−α∇J(θ)
• where α is the learning rate, and ∇J(θ) is the gradient of the cost function.
Nice-to-know
• Types of Gradient Descent:
1. Batch Gradient Descent:
1. Uses the entire dataset to compute the gradient at
each step.
2. Advantages: Stable convergence.
3. Disadvantages: Slow for large datasets.
2. Stochastic Gradient Descent (SGD):
1. Uses a single data point to compute the gradient for
each step.
2. Advantages: Faster updates, useful for large datasets.
3. Disadvantages: Noisy updates can lead to
convergence to suboptimal points.
3. Mini-Batch Gradient Descent: Variants (Adaptive Methods):
1. Uses a small subset (mini-batch) of the dataset to
compute the gradient. • Momentum: Accelerates convergence by
2. Advantages: Balances speed and accuracy; reduces using past gradients.
noise while still being computationally efficient. • Adam (Adaptive Moment Estimation):
3. Disadvantages: Needs tuning for batch size. Adapts learning rate for each parameter
based on past gradients and their
magnitudes.
• RMSProp: Scales learning rate based on
recent gradients to stabilize updates.
Algorithm flow
1. Handles one mini-batch at a time, and it goes through the full
training set multiple times. Each pass is called an epoch.
2. Each mini-batch is passed to the network’s input layer, which just
sends it to the first hidden layer.
a. computes the output of all the neurons in this layer.
b. Pass the result on to the next layer, its output is computed and passed to
the next layer, and so on until we get the output of the last layer, the
output layer.
c. This is the forward pass: it makes predictions, except all intermediate
results are preserved since they are needed for the backward pass.
3. Next, the algorithm measures the network’s output error
• it uses a loss function that compares the desired output and the actual output
of the network, and returns some measure of the error.
4. Computes how much each output connection contributed to the
error.
• by applying the chain rule, which makes this step fast and precise.
5. The algorithm then measures how much of these error
contributions came from each connection in the layer below
• using the chain rule until the algorithm reaches the input layer.
6. This reverse pass efficiently measures the error gradient across all
the connection weights in the network by propagating the error
gradient backward through the network
7. Finally, the algorithm performs a Gradient Descent step to tweak all
the connection weights in the network, using the error gradients it
just computed.
• Summary:
• For each training instance the backpropagation algorithm
a. first makes a prediction (forward pass),
b. measures the error,
c. then goes through each layer in reverse to measure the error
contribution from each connection (reverse pass), and
d. finally slightly tweaks the connection weights to reduce the error
(Gradient Descent step).
Nice-to-know
• An activation function in a neural network determines whether a
neuron should be activated (fired) or not based on the input it
receives.
• Why is it used?
• Non-linearity: Without activation functions, the neural network would only
perform linear transformations.
• Decision making: By using different types of activation functions, the network
can learn and approximate any function, enabling it to solve both linear and
non-linear problems.
Activation functions
• Backpropagation algorithm works well with many activation functions
• Logistic function
✔ output value ranges from 0 to 1
• The hyperbolic tangent function tanh(z) = 2σ(2z) – 1
✔ S-shaped, continuous, and differentiable
✔ output value ranges from –1 to 1
✔ tends to make each layer’s output more or less centered around 0 at
the beginning of training.
✔ helps speed up convergence.
• The Rectified Linear Unit function: ReLU(z) = max(0, z)
✔ continuous but not differentiable at z = 0
✔ Derivative is 0 for z < 0
✔ fast to compute
✔ does not have a maximum output value
✔ helps reduce some issues during Gradient Descent
• Sigmoid activation function:1 / (1 + exp(–z))
✔ Outputs value between 0 and 1
✔ Used for binary classification
Activation functions and their derivatives
Nice-to-know
Summary of Activation functions
Nice-to-know
• Why do we need activation functions?
• When several linear transformations are chained, all you get is a
linear transformation
• f(x) = 2 x + 3 and g(x) = 5 x – 1
• f(g(x)) = 2(5 x - 1) + 3 = 10 x + 1.
• So if you don’t have some non-linearity between layers, then even a
deep stack of layers is equivalent to a single layer: you cannot solve
very complex problems with that.
Regression MLPs
• MLPs are used for regression tasks
• To predict a single value - a single output(predicted value) neuron is
needed(e.g., the price of a house given many of its features)
• For multivariate regression (i.e., to predict multiple values at once), -
one output neuron per output dimension.
• Example: to locate the center of an object on an image.
• need to predict 2D coordinates, so you need two output neurons.
• To place bounding box around the object – two more neurons (width and
height of the object)- i.e. 4 output neurons.
When building an MLP for regression
• Activation function for the output neurons are not needed, so they
are free to output any range of values.
• To make the output always positive, use
• the ReLU activation function, or
• the softplus activation function in the output layer.
• To make the predictions fall within a given range of values, use
a. the logistic function; range of 0 to 1 or
b. the hyperbolic tangent; range of –1 to 1.
• The loss function is essential for training neural networks, as it serves
as the feedback mechanism that helps the model learn by minimizing
the error between predictions and actual values.
• It quantifies how well or poorly the model is performing.
• Types of loss functions during training:
a. mean squared error(commonly used- predict continuous values),
b. mean absolute error: if many outliers are present in the training
set.
c. Huber loss is a combination of both(less sensitive to outliers than
MSE but smoother than MAE near the minimum).
Classification MLPs
• MLPs can also be used for classification tasks.
• For a binary classification problem- a single output neuron using the
logistic activation function.
• MLPs can also easily handle multilabel binary classification tasks using
sigmoid activation function.
Example:
• An email classification system that predicts whether each incoming
email is ham or spam
• Also simultaneously predicts whether it is an urgent or non-urgent
email.
• For this case, two output neurons are needed
• using the logistic activation function:
• the first would output the probability that the email is spam and
• the second would output the probability that it is urgent.
dedicate one output neuron for each positive class.
• Output probabilities do not necessarily add up to one.
• This lets the model output any combination of labels: you can have
non-urgent ham, urgent ham, non-urgent spam, and perhaps even
urgent spam.
• If each instance can belong only to a single class, out of 3 or more
possible classes (e.g., classes 0 through 9 for digit image classification)
• then one output neuron per class is needed
• use the softmax activation function for the whole output layer.
• The softmax function will ensure that all the estimated probabilities
are between 0 and 1 and that they add up to one
• This is called multiclass classification.
• Categorical cross-entropy loss function is used for multiclass
classification.
Nice-to-know
• The softmax activation function is used in the output layer of a
neural network.
• It allows the model to output probabilities for each class, making it
suitable for multi-class classification tasks, where each output
represents the likelihood of a particular class.
• Ex: identification of digits
Nice-to-know
• Categorical Cross-Entropy is a loss function used in multi-class
classification tasks where the target labels are one-hot encoded (i.e.,
a single class is 1 and the rest are 0).
• It measures the difference between the predicted probability
distribution (from the softmax output) and the actual distribution
(the true labels).
Popular libraries for DL
1. Keras
2. TensorFlow
3. PyTorch
4. Caffe
5. Theano(now in Keras)
6. CNTK(Microsoft Cognitive Toolkit)
7. JAX
8. Fastai
9. MXNet
Implementing MLPs with Keras
• Keras is a high-level Deep Learning API that allows you to easily build, train,
evaluate and execute all sorts of neural networks.
• Its documentation is available at https://keras.io.
• install TensorFlow in Anaconda:
Open the Anaconda Prompt
conda create -n tensorflow_env python=3.11
conda activate tensorflow_env
conda install -c conda-forge tensorflow
pip install --ignore-installed --upgrade tensorflow
conda install spyder
• https://stackoverflow.com/questions/46568913/tensorflow-import-error-n
o-module-named-tensorflow
• To test your installation
• import tensorflow as tf
• from tensorflow import keras
• tf.__version__
• keras.__version__
Building an Image Classifier Using the
Sequential API
• Dataset: Fashion MNIST (images represent fashion items)
• Using Keras to Load the Dataset
• fashion_mnist = keras.datasets.fashion_mnist
• (X_train_full, y_train_full), (X_test, y_test)
=fashion_mnist.load_data()
Nice-to-kno
w:
displaying
Fashion
MNIST data
• shape and data type of the training set:

• split into a training set and a test set, but there is no validation set
• scale the pixel intensities down to the 0-1 range by dividing them by
255.0
• list of class names

• For example, the first image in the training set represents a coat:
• Creating the Model Using the Sequential API

• The first line creates a Sequential model-composed of a single stack


of layers, connected sequentially. This is called the sequential API.
• Next, build the first layer and add it to the model.
• It is a Flatten layer whose role is to convert each input image into a 1D array:
if it receives input data X, it computes X.reshape(-1, 1).
• This layer does not have any parameters- does some simple preprocessing.
Since it is the first layer in the model, you should specify the input_shape: this
does not include the batch size, only the shape of the instances.
• Alternatively, could add a keras.layers.InputLayer as the first layer, setting
shape=[28,28].
• Next, a Dense hidden layer with 300 neurons- use the ReLU activation
function.
• Each Dense layer manages its own weight matrix, containing all the
connection weights between the neurons and their inputs.
• It also manages a vector of bias terms (one per neuron).
• When it receives some input data, it computes
• Next,add a second Dense hidden layer with 100 neurons, also using
the ReLU activation function.
• Finally, we add a Dense output layer with 10 neurons (one per class),
using the softmax activation function.
• Specifying activation="relu" is equivalent to
activation=keras.activations.relu.
• Instead of adding the layers one by one as we just did, you can pass a
list of layers when creating the Sequential model:
• model’s summary() method displays all the model’s layers

• the first hidden layer has 784 × 300 connection weights, plus 300 bias
terms, which adds up to 235,500 parameters
• can easily get a model’s list of layers, to fetch a layer by its index, or
you can fetch it by name:
• All the parameters of a layer can be accessed using its get_weights()
and set_weights() method
• Compiling the Model:
• call its compile() method to specify the loss function
• and the optimizer to use and extra metrics also

• Using loss="sparse_categorical_crossentropy" is equivalent to


loss=keras.losses.sparse_categorical_crossentropy.
• optimizer="sgd" is equivalent to optimizer=keras.optimizers.SGD()
• metrics=["accuracy"] is equivalent to
metrics=[keras.metrics.sparse_categorical_accuracy] (
• Training and Evaluating the Model
• The neural network is trained.
• At each epoch during training, Keras displays
the number of instances processed
the mean training time per sample, the loss and accuracy (or any
other extra metrics specified), both on the training set and the
validation set.
• In example the training loss went down(good sign), and the validation
accuracy reached 87.28% after 50 epochs, not too far from the
training accuracy, so there does not seem to be much overfitting
going on.
• The fit() method returns a History object containing
the training parameters (history.params),
the list of epochs it went through (history.epoch), and
a dictionary (history.history) containing the loss and extra metrics it
measured at the end of each epoch on the training set and on the
validation set.
• create a Pandas DataFrame using this dictionary and call plot()
method to get the learning curves
• The training and validation accuracy steadily increase during training
• The training and validation loss decreases which means that there is
not too much overfitting.
• The model performed better on the validation set than on the
training set at the beginning of training.
• The training set performance beats the validation performance, as is
generally the case when you train for long enough.
• The model has not quite converged yet, as the validation loss is still
going down, so should continue training
• Once you are satisfied with your model’s validation accuracy,
evaluate it on the test set to estimate the generalization error before
you deploy the model to production.

• common to get slightly lower performance on the test set than on the
validation set, because the hyperparameters are tuned on the
validation set, not the test set.
• Using the Model to Make Predictions
• use the model’s predict() method to make predictions on new
instances.
• Note:Since we don’t have actual new instances, we will just use the
first 3 instances of the test set:
• the classifier actually classified all three images correctly
https://forms.gle/mD9qn7kwrrc63sCB7
Building a Regression MLP Using the
Sequential API
• Dataset: California housing problem to tackle it using a regression
neural network.
• use Scikit-Learn’s fetch_california_housing() function to load the data
• The output layer has a single neuron and uses no activation function,
and the loss function is the mean squared error.
• Since the dataset is quite noisy, we just use a single hidden layer with
fewer neurons than before, to avoid overfitting
Building Complex Models Using the
Functional API
• A non-sequential neural network is
a Wide & Deep neural network.
• Wide: inputs directly connected to
the output
• enables the model to memorize
simple patterns
• Deep: a stack of layers that
processes the inputs.
• enables the model to generalize
from patterns.
• The neural network learns both deep patterns (using the deep path)
and simple rules (through the short path).
• In contrast, a regular MLP forces all the data to flow through the full
stack of layers, thus simple patterns in the data may end up being
distorted by this sequence of transformations.
• Input Layers (input_wide and
input_deep):
wide connects the input directly to the
output.
deep passes through hidden layers.
• Hidden Layers (hidden1 and
hidden2):
part of the deep path, allowing the
model to learn complex patterns.
• Concatenation:
The wide and deep paths are merged
before feeding into the output layer.
• Output Layer:
A single neuron is used for
regression.
• To send different subsets of input features through the wide and
deep paths while sharing some features between them
• use the Functional API to split the input features and route them
through the appropriate paths.
Example: to send 5 features through the wide path (features 0
to 4) and 6 features through the deep path (features 2 to 7),
with features 2, 3, and 4 going through both paths.
• Lambda Layers for Feature
Selection:
✔ use keras.layers.Lambda to extract
the desired subsets of features for
both the wide and deep paths.
• Wide Path (input_wide):
Selects features 0 to 4 (x[:, :5]).
• Deep Path (input_deep):
Selects features 2 to 7 (x[:, 2:8]).
• Concatenation:
After processing the inputs in the
wide and deep paths, concatenate
the outputs and feed them into the
final output layer.
Handling Multiple inputs
• using TensorFlow's Functional API: can handle multiple inputs to build
networks that process different types of data independently.
• For example, a model that takes in both structured data (e.g., tabular data)
and unstructured data (e.g., images or text) as inputs, processes them
differently, and combines them for final predictions.
• Program(next slide)
• This model accepts two different inputs:
structured/tabular data (like house features).
text data (like a description).
• The model will process each input independently through separate paths
and combine their results before making the final prediction.
• Independent Processing:
The structured data goes through two
dense layers (dense1_structured,
dense2_structured).
The text data goes through an
embedding layer followed by a Flatten
operation.
• Concatenation:
The outputs from both the structured
and text paths are concatenated using
Concatenate().
• Multiple Inputs: • Final Output:
input_structured: Accepts structured/tabular data with
5 features. The combined features from both inputs
input_text: Accepts text input (as a sequence of 10 are passed to the output layer, which is
integers, representing word indices). a single neuron with a sigmoid activation
for binary classification.
• Embedding Layer:
For text data, use an Embedding layer to convert
word indices into dense vectors of size 8.
The output of the embedding layer is then flattened
to connect with the rest of the model.
Handling Multiple Outputs- Auxiliary Output
for Regularization
• When handling multiple outputs in a model using TensorFlow's
Functional API, a network that makes multiple predictions based on
different objectives can be built.
• Each output can have its own loss function, and the model can learn
to optimize all objectives simultaneously.
• Example: to build a multi-output model for a housing price prediction
task. The model will have:
Output 1: A prediction for the house price (a regression task).
Output 2: A binary classification indicating whether the house is in a
high-price range.
• Auxiliary Output for Regularization: • Multiple Outputs:
• The auxiliary output is connected to • (main_output): Predicts the
(hidden1), which helps regularize the house price from the final
training process. hidden layer.
• By predicting the same target (house • (auxiliary_output): Also predicts
prices), the auxiliary output ensures the house price, but from an
that the earlier layers learn useful earlier hidden layer (hidden1).
representations, reducing
overfitting.
• Model Training:
During training, both outputs
contribute to the total loss.
This improves gradient flow
and helps the model learn
useful representations early in
the network.

• Loss Function with Weights: • Predictions:


• for the main output is weighted more • After training, the model can generate
heavily (0.9) because it’s the primary predictions from both the main and
task. auxiliary outputs, though typically only
• for the auxiliary output is weighted less the main output would be used for real
(0.1) as it mainly serves as a regularizer. predictions.
• Benefits of Auxiliary Output:
Improved Gradient Flow: The auxiliary output provides additional
gradient signals to the earlier layers, which can prevent the vanishing
gradient problem.
Regularization: The auxiliary output acts as a regularizer by ensuring
that earlier layers learn useful representations, reducing the risk of
overfitting.
Faster Convergence: This approach often leads to faster convergence
during training because the network receives more feedback at
intermediate layers.
Building Dynamic Models Using the
Subclassing API
• The Subclassing API in TensorFlow/Keras allows for more flexibility in
building dynamic models compared to the Sequential or Functional
APIs.
• By subclassing tf.keras.Model
• custom behavior for the forward pass can be designed
makes it easier to build models where the architecture might change
depending on the input or other conditions
(e.g., recurrent neural networks, reinforcement learning models,
etc.).
• Custom Flexibility:
By subclassing the Model class, custom
behavior can be introduced.
For example, you could dynamically change
how the model processes the input based on
the input size, batch size, or even based on
previous states.

• Subclassing the Model:


• Training:
define a class MyCustomModel that subclasses
keras.Model. • The model is compiled and trained in the same
The __init__() method initializes the layers (Dense layers in way as in the Sequential or Functional APIs,
this case), while the call() method defines the forward pass.
• call() Method: using the fit() method, and evaluated using
define the forward pass (how data flows through the evaluate().
network).
can be as dynamic as you need, allowing conditional logic,
• Prediction:
loops, etc.
• After training, you can use predict() to
Here, we pass the input data through two hidden layers
and then to the output layer. generate predictions, just as in other Keras
models.
• Why Use the Subclassing API? • Example Use Cases:
• Flexibility: can implement complex • Recurrent Neural Networks (RNNs),
models, including dynamic where the model processes data
architectures where the network sequentially, and the behavior can
topology can change depending on change with each timestep.
input or internal states. • Conditional Logic: Models that make
• Customization: Full control over the decisions based on input, like
forward pass and layers, allowing you branching neural networks or
to implement unique layers, attention-based models.
connections, or operations. • Custom Losses and Metrics: Define
• Research: Ideal for experimental custom layers and calculations during
models, such as reinforcement training, which may depend on the
learning agents or generative models internal state of the model.
where dynamic behavior is necessary.
Saving and Restoring a Model
• Saving a trained Keras model:
• model.save("my_keras_model.h5")
• Keras will save
the model’s architecture (including every layer’s hyperparameters)
the value of all the model parameters for every layer (e.g., connection
weights and biases), using the HDF5 format.
saves the optimizer (including its hyperparameters and any state it
may have).
• Loading the model:
• model = keras.models.load_model("my_keras_model.h5")
• When training lasts several hours on large datasets.
❖ not only save your model at the end of training but also save
checkpoints at regular intervals during training.
• But how can you tell the fit() method to save checkpoints?
• The answer is: using callbacks.
Using Callbacks
• fit() method accepts a callbacks argument can specify a list of
objects that Keras will call during training
at the start and end of training,
at the start and end of each epoch and
even before and after processing each batch.
• the ModelCheckpoint callback saves checkpoints of a model at
regular intervals during training, by default at the end of each epoch:
• if the validation set is used during training
• can set save_best_only=True when creating the ModelCheckpoint.
• This will only save the model when its performance on the validation
set is the best.
• need not worry about training for too long and overfitting the
training set: restore the last model saved after training, and this will
be the best model on the validation set.
• This is a simple way to implement early stopping
• Early stopping using the EarlyStopping callback.
• It will
interrupt training when it measures no progress on the validation set
for some epochs
optionally roll back to the best model.
• The number of epochs can be set to a large value since training will
stop automatically when there is no more progress.
• EarlyStopping callback will keep track of the best weights and restore
them at the end of training.

• NOTE: There are many other callbacks available in the keras.callbacks


package.
Visualization Using TensorBoard
• Assignment
Fine-tuning Neural Network Parameters
• The flexibility of neural networks is also one of their main drawbacks:
there are many hyperparameters to tweak.
• Even in a simple MLP you can change
❑ the number of layers,
❑ the number of neurons per layer,
❑ the type of activation function to use in each layer,
❑ the weight initialization logic, and much more.
• An option: try many combinations of hyperparameters and see which
one works best on the validation set.
• An approach for this is to use GridSearchCV or RandomizedSearchCV
to explore the hyperparameter space.
• Wrap Keras models in objects that mimic regular Scikit-Learn
regressors.
First step is to create a function that will build and compile a
Keras model, given a set of hyperparameters:

• create a Sequential model for univariate • The options dict is used to ensure that the
regression (only one output neuron) first layer is given to the input shape (note
• with the given input shape and number of that if n_hidden=0, the first layer will be the
hidden layers and neurons output layer).
• compiles it using an SGD optimizer • It is good practice to provide reasonable
configured with the given learning rate. defaults to as many hyperparameters as you
can.
Next, let’s create a KerasRegressor based on this build_model()
function:

• KerasRegressor object is a thin wrapper around the Keras model built using
build_model().
• Use this object like a regular Scikit-Learn regressor:
• Train using its fit() method, then evaluate using score() method, and use it to make
predictions using predict() method.
Train hundreds of variants and see which one performs best on the validation
set.
Since there are many hyperparameters, it is preferable to use a randomized
search rather than grid search
• explore the number of hidden layers, the number of neurons and the
learning rate:
• The exploration may last many hours depending on the hardware, the
size of the dataset, the complexity of the model and the value of
n_iter and cv.
• After that, can access the best parameters found, the best score, and
the trained Keras model like this:
A few Python libraries you can use to
optimize hyperparameters:
1. Hyperopt: a popular Python library for optimizing over all sorts of
complex search spaces (including real values such as the learning rate, or
discrete values such as the number of layers).
2. Hyperas, kopt or Talos: optimizing hyperparameters for Keras model (the
first two are based on Hyperopt).
3. Scikit-Optimize (skopt): a general-purpose optimization library. The Bayes
SearchCV class performs Bayesian optimization using an interface similar
to Grid SearchCV.
4. Spearmint: a Bayesian optimization library.
5. Sklearn-Deap: a hyperparameter optimization library based on
evolutionary algorithms, also with a GridSearchCV-like interface.
• Guidelines for choosing
the number of hidden layers
neurons per hidden layer and
selecting good values like Learning Rate, Batch Size and Other
Hyperparameters
Number of Hidden Layers
• An MLP with one hidden layer can model even the most complex
functions provided it has enough neurons.
• Deep networks have a much higher parameter efficiency than shallow
ones: they can model complex functions using exponentially fewer
neurons than shallow nets, allowing them to reach much better
performance with the same amount of training data.
• Deep Neural Networks:
lower hidden layers model low-level structures (e.g., line segments of
various shapes and orientations),
intermediate hidden layers combine these low-level structures to
model intermediate-level structures (e.g., squares, circles), and
highest hidden layers and the output layer combine these
• intermediate structures to model high-level structures (e.g., faces).
• This hierarchical architecture helps DNNs
1. converge faster to a good solution,
2. improve their ability to generalize to new datasets.
• Example
❑ already trained a model to recognize faces in pictures
❑ now want to train a new neural network to recognize hairstyles
• Then, kickstart training by reusing the lower layers of the first
network.
network will not have to learn from scratch all the low-level
structures that occur in most pictures
will only have to learn the higher-level structures (e.g., hairstyles).
• This is called transfer learning.
• For many problems you can start with just one or two hidden layers and it
will work fine
e.g., above 97% accuracy on the MNIST dataset using just one hidden layer
above 98% accuracy using two hidden layers with the same total amount of
neurons, in roughly the same amount of training time
• For more complex problems, gradually ramp up the number of hidden
layers, until you start overfitting the training set.
• Very complex tasks, such as large image classification or speech
recognition, typically require networks with dozens of layers and they need
a huge amount of training data.
• However, you will rarely have to train such networks from scratch: it
is much more common to Fine-Tuning.
• Reuse parts of a pretrained network that performs a similar task.
• Training will be a lot faster and require much less data.
Number of Neurons per Hidden Layer
• The number of neurons in the input and output layers is determined by the
type of input and output your task requires.
• Hidden layers
common practice: size them to form a pyramid
with fewer and fewer neurons at each layer(philosophy: many low level
features can coalesce into far fewer high-level features)
• Example, a typical neural network for MNIST may have three hidden layers,
• the first with 300 neurons, the second with 200, and the third with 100.
• practice is abandoned now: simply using the same number of neurons in all
hidden layers performs just as well in most cases, or even better
• However, depending on the dataset, it can sometimes help to make the
first hidden layer bigger than the others.
• Try increasing the number of neurons gradually until the network
starts overfitting.
• In general, better to increase the number of layers than the number
of neurons per layer.
• Finding the perfect amount of neurons is still somewhat of a dark art.
• A simple approach:
Pick a model with more layers and neurons than you actually need
then use early stopping to prevent it from overfitting (and other
regularization techniques)
• Nice-to-know:
• This is the “stretch pants” approach:
• instead of wasting time looking for pants that perfectly match your
size, just use large stretch pants that will shrink down to the right size.
Learning Rate, Batch Size and Other
Hyperparameters
• Few hyperparameters, and some tips on how to set them:
• The learning rate:
the most important hyperparameter
The optimal learning rate is about half of the maximum learning rate.
An approach for tuning the learning rate: start with a large value that
makes the training algorithm diverge, then divide this value by 3 and
try again, and repeat until the training algorithm stops diverging.
• It is sometimes useful to reduce the learning rate during training.
• Choosing a better optimizer than Mini-batch Gradient Descent (and
tuning its hyperparameters) is important.
• The batch size has a significant impact on the model’s performance
and the training time.
• the optimal batch size will be lower than 32
• A small batch size ensures that each training iteration is very fast (a
large batch size will give a more precise estimate of the gradients).
• Having a batch size greater than 10 helps take advantage of hardware
and software optimizations (matrix multiplications: will speed up
training.
• Batch Normalization: the batch size should not be too small (in no
less than 20).
• Choice of the activation function:
• The ReLU activation function will be a good default for all hidden
layers.
• For the output layer, it really depends on your task.
• In most cases, the number of training iterations does not actually
need to be tweaked: just use early stopping instead.
---XXX---

You might also like