Mod-1 Part 1
Mod-1 Part 1
Mod-1 Part 1
23DS5PCDLG
Syllabus
Sem V
Course Title: Deep Learning
Course Code: 23DS5PCDLG Total Contact Hours: 50 hours
Total
L-T-P: 4-0-1 5
Credits:
Laboratory Plan
• X represents the matrix of input features. It has one row per instance, one
column per feature.
• The weight matrix W contains all the connection weights except for the
ones from the bias neuron.
• It has one row per input neuron and one column per artificial neuron in the
layer.
• The bias vector b contains all the connection weights between the bias
neuron and the artificial neurons. It has one bias term per artificial neuron.
• The function ϕ is called the activation function: when the artificial neurons
are TLUs, it is a step function
How is a Perceptron trained?
• Hebb’s rule largely inspired the Perceptron training algorithm
proposed by Frank Rosenblatt.
• When a biological neuron often triggers another neuron, the
connection between these two neurons grows stronger.
• “Cells that fire together, wire together.”
• Hebbian learning: connection weight between two neurons is
increased whenever they have the same output.
• Perceptrons are trained using a variant of this rule that considers the
network’s error; it reinforces connections that help reduce the error.
• The Perceptron is fed one training instance at a time, and for each
instance it makes its predictions.
• For every output neuron that produced a wrong prediction, it
reinforces the connection weights from the inputs that would have
contributed to the correct prediction.
Perceptron learning rule (weight update)
• OR Gate and AND Gate perceptron calculations problems.
• import numpy as np
• from sklearn.datasets import load_iris
• from sklearn.linear_model import Perceptron
• iris = load_iris()
• X = iris.data[:, (2, 3)] # petal length, petal width
• y = (iris.target == 0).astype(np.int) # Iris Setosa?
• per_clf = Perceptron()
• per_clf.fit(X, y)
• y_pred = per_clf.predict([[2, 0.5]])
Nice-to-know
• Perceptron learning algorithm strongly resembles Stochastic Gradient
Descent.
• Scikit-Learn’s Perceptron class is equivalent to using an SGDClassifier
with the following hyperparameters: loss="perceptron",
learning_rate="constant", eta0=1 (the learning rate), an
penalty=None (no regularization).
• Perceptrons do not output a class probability; rather, they make
predictions based on a hard threshold.
• The perceptron is a basic building block in neural networks, but its
inability to handle non-linear problems, lack of depth, and
convergence issues mean that it's limited in complex, real-world
applications.
• These limitations are addressed in modern neural network
architectures with Multi-Layer Perceptron
• XOR Gate
• split into a training set and a test set, but there is no validation set
• scale the pixel intensities down to the 0-1 range by dividing them by
255.0
• list of class names
• For example, the first image in the training set represents a coat:
• Creating the Model Using the Sequential API
• the first hidden layer has 784 × 300 connection weights, plus 300 bias
terms, which adds up to 235,500 parameters
• can easily get a model’s list of layers, to fetch a layer by its index, or
you can fetch it by name:
• All the parameters of a layer can be accessed using its get_weights()
and set_weights() method
• Compiling the Model:
• call its compile() method to specify the loss function
• and the optimizer to use and extra metrics also
• common to get slightly lower performance on the test set than on the
validation set, because the hyperparameters are tuned on the
validation set, not the test set.
• Using the Model to Make Predictions
• use the model’s predict() method to make predictions on new
instances.
• Note:Since we don’t have actual new instances, we will just use the
first 3 instances of the test set:
• the classifier actually classified all three images correctly
https://forms.gle/mD9qn7kwrrc63sCB7
Building a Regression MLP Using the
Sequential API
• Dataset: California housing problem to tackle it using a regression
neural network.
• use Scikit-Learn’s fetch_california_housing() function to load the data
• The output layer has a single neuron and uses no activation function,
and the loss function is the mean squared error.
• Since the dataset is quite noisy, we just use a single hidden layer with
fewer neurons than before, to avoid overfitting
Building Complex Models Using the
Functional API
• A non-sequential neural network is
a Wide & Deep neural network.
• Wide: inputs directly connected to
the output
• enables the model to memorize
simple patterns
• Deep: a stack of layers that
processes the inputs.
• enables the model to generalize
from patterns.
• The neural network learns both deep patterns (using the deep path)
and simple rules (through the short path).
• In contrast, a regular MLP forces all the data to flow through the full
stack of layers, thus simple patterns in the data may end up being
distorted by this sequence of transformations.
• Input Layers (input_wide and
input_deep):
wide connects the input directly to the
output.
deep passes through hidden layers.
• Hidden Layers (hidden1 and
hidden2):
part of the deep path, allowing the
model to learn complex patterns.
• Concatenation:
The wide and deep paths are merged
before feeding into the output layer.
• Output Layer:
A single neuron is used for
regression.
• To send different subsets of input features through the wide and
deep paths while sharing some features between them
• use the Functional API to split the input features and route them
through the appropriate paths.
Example: to send 5 features through the wide path (features 0
to 4) and 6 features through the deep path (features 2 to 7),
with features 2, 3, and 4 going through both paths.
• Lambda Layers for Feature
Selection:
✔ use keras.layers.Lambda to extract
the desired subsets of features for
both the wide and deep paths.
• Wide Path (input_wide):
Selects features 0 to 4 (x[:, :5]).
• Deep Path (input_deep):
Selects features 2 to 7 (x[:, 2:8]).
• Concatenation:
After processing the inputs in the
wide and deep paths, concatenate
the outputs and feed them into the
final output layer.
Handling Multiple inputs
• using TensorFlow's Functional API: can handle multiple inputs to build
networks that process different types of data independently.
• For example, a model that takes in both structured data (e.g., tabular data)
and unstructured data (e.g., images or text) as inputs, processes them
differently, and combines them for final predictions.
• Program(next slide)
• This model accepts two different inputs:
structured/tabular data (like house features).
text data (like a description).
• The model will process each input independently through separate paths
and combine their results before making the final prediction.
• Independent Processing:
The structured data goes through two
dense layers (dense1_structured,
dense2_structured).
The text data goes through an
embedding layer followed by a Flatten
operation.
• Concatenation:
The outputs from both the structured
and text paths are concatenated using
Concatenate().
• Multiple Inputs: • Final Output:
input_structured: Accepts structured/tabular data with
5 features. The combined features from both inputs
input_text: Accepts text input (as a sequence of 10 are passed to the output layer, which is
integers, representing word indices). a single neuron with a sigmoid activation
for binary classification.
• Embedding Layer:
For text data, use an Embedding layer to convert
word indices into dense vectors of size 8.
The output of the embedding layer is then flattened
to connect with the rest of the model.
Handling Multiple Outputs- Auxiliary Output
for Regularization
• When handling multiple outputs in a model using TensorFlow's
Functional API, a network that makes multiple predictions based on
different objectives can be built.
• Each output can have its own loss function, and the model can learn
to optimize all objectives simultaneously.
• Example: to build a multi-output model for a housing price prediction
task. The model will have:
Output 1: A prediction for the house price (a regression task).
Output 2: A binary classification indicating whether the house is in a
high-price range.
• Auxiliary Output for Regularization: • Multiple Outputs:
• The auxiliary output is connected to • (main_output): Predicts the
(hidden1), which helps regularize the house price from the final
training process. hidden layer.
• By predicting the same target (house • (auxiliary_output): Also predicts
prices), the auxiliary output ensures the house price, but from an
that the earlier layers learn useful earlier hidden layer (hidden1).
representations, reducing
overfitting.
• Model Training:
During training, both outputs
contribute to the total loss.
This improves gradient flow
and helps the model learn
useful representations early in
the network.
• create a Sequential model for univariate • The options dict is used to ensure that the
regression (only one output neuron) first layer is given to the input shape (note
• with the given input shape and number of that if n_hidden=0, the first layer will be the
hidden layers and neurons output layer).
• compiles it using an SGD optimizer • It is good practice to provide reasonable
configured with the given learning rate. defaults to as many hyperparameters as you
can.
Next, let’s create a KerasRegressor based on this build_model()
function:
• KerasRegressor object is a thin wrapper around the Keras model built using
build_model().
• Use this object like a regular Scikit-Learn regressor:
• Train using its fit() method, then evaluate using score() method, and use it to make
predictions using predict() method.
Train hundreds of variants and see which one performs best on the validation
set.
Since there are many hyperparameters, it is preferable to use a randomized
search rather than grid search
• explore the number of hidden layers, the number of neurons and the
learning rate:
• The exploration may last many hours depending on the hardware, the
size of the dataset, the complexity of the model and the value of
n_iter and cv.
• After that, can access the best parameters found, the best score, and
the trained Keras model like this:
A few Python libraries you can use to
optimize hyperparameters:
1. Hyperopt: a popular Python library for optimizing over all sorts of
complex search spaces (including real values such as the learning rate, or
discrete values such as the number of layers).
2. Hyperas, kopt or Talos: optimizing hyperparameters for Keras model (the
first two are based on Hyperopt).
3. Scikit-Optimize (skopt): a general-purpose optimization library. The Bayes
SearchCV class performs Bayesian optimization using an interface similar
to Grid SearchCV.
4. Spearmint: a Bayesian optimization library.
5. Sklearn-Deap: a hyperparameter optimization library based on
evolutionary algorithms, also with a GridSearchCV-like interface.
• Guidelines for choosing
the number of hidden layers
neurons per hidden layer and
selecting good values like Learning Rate, Batch Size and Other
Hyperparameters
Number of Hidden Layers
• An MLP with one hidden layer can model even the most complex
functions provided it has enough neurons.
• Deep networks have a much higher parameter efficiency than shallow
ones: they can model complex functions using exponentially fewer
neurons than shallow nets, allowing them to reach much better
performance with the same amount of training data.
• Deep Neural Networks:
lower hidden layers model low-level structures (e.g., line segments of
various shapes and orientations),
intermediate hidden layers combine these low-level structures to
model intermediate-level structures (e.g., squares, circles), and
highest hidden layers and the output layer combine these
• intermediate structures to model high-level structures (e.g., faces).
• This hierarchical architecture helps DNNs
1. converge faster to a good solution,
2. improve their ability to generalize to new datasets.
• Example
❑ already trained a model to recognize faces in pictures
❑ now want to train a new neural network to recognize hairstyles
• Then, kickstart training by reusing the lower layers of the first
network.
network will not have to learn from scratch all the low-level
structures that occur in most pictures
will only have to learn the higher-level structures (e.g., hairstyles).
• This is called transfer learning.
• For many problems you can start with just one or two hidden layers and it
will work fine
e.g., above 97% accuracy on the MNIST dataset using just one hidden layer
above 98% accuracy using two hidden layers with the same total amount of
neurons, in roughly the same amount of training time
• For more complex problems, gradually ramp up the number of hidden
layers, until you start overfitting the training set.
• Very complex tasks, such as large image classification or speech
recognition, typically require networks with dozens of layers and they need
a huge amount of training data.
• However, you will rarely have to train such networks from scratch: it
is much more common to Fine-Tuning.
• Reuse parts of a pretrained network that performs a similar task.
• Training will be a lot faster and require much less data.
Number of Neurons per Hidden Layer
• The number of neurons in the input and output layers is determined by the
type of input and output your task requires.
• Hidden layers
common practice: size them to form a pyramid
with fewer and fewer neurons at each layer(philosophy: many low level
features can coalesce into far fewer high-level features)
• Example, a typical neural network for MNIST may have three hidden layers,
• the first with 300 neurons, the second with 200, and the third with 100.
• practice is abandoned now: simply using the same number of neurons in all
hidden layers performs just as well in most cases, or even better
• However, depending on the dataset, it can sometimes help to make the
first hidden layer bigger than the others.
• Try increasing the number of neurons gradually until the network
starts overfitting.
• In general, better to increase the number of layers than the number
of neurons per layer.
• Finding the perfect amount of neurons is still somewhat of a dark art.
• A simple approach:
Pick a model with more layers and neurons than you actually need
then use early stopping to prevent it from overfitting (and other
regularization techniques)
• Nice-to-know:
• This is the “stretch pants” approach:
• instead of wasting time looking for pants that perfectly match your
size, just use large stretch pants that will shrink down to the right size.
Learning Rate, Batch Size and Other
Hyperparameters
• Few hyperparameters, and some tips on how to set them:
• The learning rate:
the most important hyperparameter
The optimal learning rate is about half of the maximum learning rate.
An approach for tuning the learning rate: start with a large value that
makes the training algorithm diverge, then divide this value by 3 and
try again, and repeat until the training algorithm stops diverging.
• It is sometimes useful to reduce the learning rate during training.
• Choosing a better optimizer than Mini-batch Gradient Descent (and
tuning its hyperparameters) is important.
• The batch size has a significant impact on the model’s performance
and the training time.
• the optimal batch size will be lower than 32
• A small batch size ensures that each training iteration is very fast (a
large batch size will give a more precise estimate of the gradients).
• Having a batch size greater than 10 helps take advantage of hardware
and software optimizations (matrix multiplications: will speed up
training.
• Batch Normalization: the batch size should not be too small (in no
less than 20).
• Choice of the activation function:
• The ReLU activation function will be a good default for all hidden
layers.
• For the output layer, it really depends on your task.
• In most cases, the number of training iterations does not actually
need to be tweaked: just use early stopping instead.
---XXX---