Deep Learning Module-01
Deep Learning Module-01
Module-01
➢ Deep learning, a subset of machine learning, has revolutionized various fields by enabling
systems to learn and make decisions with minimal human intervention.
➢ At its core, deep learning leverages artificial neural networks with multiple layers (hence
ud
"deep") to model complex patterns in data.
➢ This introduction provides an overview of deep learning models, their architectures,
applications, and significance in today's technological landscape.
lo
❖ Deep learning involves training artificial neural networks computational models inspired
by the human brain to recognize patterns and make decisions based on vast amounts of
data.
❖ Unlike traditional machine learning, which may require feature engineering and manual
C
intervention, deep learning models automatically discover representations and features
from raw data, making them particularly effective for tasks like image and speech
recognition.
tu
2. Layers:
Page 1
21CS743 | DEEP LEARNING
4. Loss Function: Measures the difference between the model's predictions and the actual
outcomes, guiding the optimization process.
ud
Popular Deep Learning Architectures
o
lo
spatial hierarchies of features from input images.
o Key Features: Incorporate loops to maintain information across time steps, making
tu
o Variants: Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)
networks address issues like vanishing gradients.
3. Transformer Models:
Page 2
21CS743 | DEEP LEARNING
ud
compete against each other, improving the quality of generated data over time.
5. Autoencoders:
o
reconstructs it. lo
Key Features: Comprise an encoder that compresses the data and a decoder that
Deep learning models have a wide array of applications across various industries:
Page 3
21CS743 | DEEP LEARNING
• Automatic Feature Extraction: Eliminates the need for manual feature engineering,
allowing models to learn directly from raw data.
• Scalability: Can handle large volumes of data and complex models with millions of
parameters.
ud
• Versatility: Applicable to diverse domains and tasks, from vision and speech to text and
beyond.
•
lo
Data Requirements: Deep learning models typically require vast amounts of labeled data,
which can be costly and time-consuming to obtain.
• Interpretability: Deep networks are often considered "black boxes," making it difficult to
understand how decisions are made.
tu
• Overfitting: Models can become too tailored to training data, reducing their ability to
generalize to new, unseen data.
Page 4
21CS743 | DEEP LEARNING
Deep learning, a branch of machine learning, has experienced tremendous growth and
transformation over the decades.
While its core principles date back to the mid-20th century, it has undergone several stages of
advancement due to technological innovations, better algorithms, and increased computational
power. Below is a timeline highlighting key historical trends in deep learning:
ud
1. Early Foundations (1940s–1960s)
The foundation for deep learning lies in early research on neural networks and the imitation of
human cognition in machines. Several key milestones shaped the beginnings of the field:
• 1943: McCulloch and Pitts: The concept of a neuron as a binary classifier was introduced
•
lo
by Warren McCulloch and Walter Pitts. They proposed a mathematical model of a neuron
that laid the groundwork for later neural network research.
1958: Perceptron by Frank Rosenblatt: The perceptron was a simple neural network
designed to perform binary classification tasks. It could learn by adjusting weights based
C
on input-output relationships, similar to modern deep learning models. However, its
limitations in handling non-linearly separable data, such as the XOR problem, restricted its
capabilities.
tu
• 1960s: Backpropagation Concept Introduced: Although it wasn't widely used until much
later, the concept of backpropagation—the algorithm for training multilayer neural
networks—was introduced by multiple researchers, including Bryson and Ho.
After initial interest, neural networks entered a period of decline, often called the "AI winter."
There was disappointment in the limitations of single-layer perceptrons, and other machine
learning methods, such as support vector machines (SVMs) and decision trees, gained traction.
• 1970s: The limitations of early neural networks, like the perceptron, led to reduced funding
and enthusiasm for the approach.
Page 5
21CS743 | DEEP LEARNING
• 1980s: Interest was revived through theoretical work, and some breakthroughs in deep
learning principles were laid during this period, though they wouldn’t be fully realized for
decades.
ud
training of multi-layer perceptrons, which overcame the limitations of single-layer models.
This development reignited interest in neural networks and laid the groundwork for future
deep learning models.
• 1989: Convolutional Neural Networks (CNNs) Introduced: Yann LeCun developed the
first CNN, LeNet, designed for image classification tasks. LeNet was able to recognize
•
lo
handwritten digits and was used by banks to process checks, marking one of the earliest
practical applications of deep learning.
1990s: Recurrent Neural Networks (RNNs): Researchers like Jürgen Schmidhuber and
Sepp Hochreiter developed Long Short-Term Memory (LSTM) networks in 1997, solving
C
the problem of vanishing gradients in standard RNNs and allowing neural networks to
better handle sequential data.
• 2006: Deep Belief Networks (DBNs): Geoffrey Hinton and his team proposed the idea of
using deep belief networks, a type of unsupervised deep neural network. This marked the
beginning of modern deep learning, where the goal was to train deeper neural networks
that could learn complex representations.
V
• 2007–2009: GPU Acceleration: The adoption of Graphics Processing Units (GPUs) for
deep learning computations drastically improved the ability to train deeper networks faster.
This technological breakthrough allowed for more practical training of neural networks
with multiple layers.
Page 6
21CS743 | DEEP LEARNING
The 2010s are often referred to as the "Golden Age" of deep learning. With the combination of
better hardware (especially GPUs), large datasets, and advanced algorithms, deep learning
achieved state-of-the-art performance across various domains.
• 2012: AlexNet and ImageNet Competition: A deep CNN called AlexNet, developed by
Alex Krizhevsky and Geoffrey Hinton, won the ImageNet Large Scale Visual Recognition
ud
Challenge by a large margin. This victory demonstrated the power of deep learning in
image recognition and spurred widespread interest in the field.
• 2014:
o
lo
consist of two networks—a generator and a discriminator—that compete against
each other, enabling the creation of highly realistic synthetic data.
• 2018–2019: Transfer Learning and Pre-trained Models: Large pre-trained models like
BERT (from Google) and GPT-2 (from OpenAI) demonstrated the power of transfer
learning, where a model pre-trained on massive datasets can be fine-tuned for specific tasks
with smaller datasets, drastically reducing training time and improving performance.
Page 7
21CS743 | DEEP LEARNING
The 2020s have seen deep learning evolve further, with a focus on more efficient models, ethical
AI practices, and novel applications.
ud
in content generation, summarization, and conversational AI.
• Deep Reinforcement Learning: Deep learning has been integrated with reinforcement
learning to create AI agents capable of mastering complex environments. Breakthroughs
like AlphaGo and AlphaZero (developed by DeepMind) demonstrate the potential of AI
in learning strategies through trial and error in dynamic environments.
• lo
Ethics and Interpretability: As deep learning models are increasingly deployed in real-
world applications, attention has shifted toward ensuring fairness, reducing biases, and
improving the interpretability of these "black box" models.
C
• Resource Efficiency: There has been a growing interest in optimizing deep learning
models to make them more resource-efficient, addressing concerns about the
environmental impact of training massive models. Techniques like pruning, quantization,
and distillation aim to reduce the computational and energy demands of deep learning
tu
models.
V
Page 8
21CS743 | DEEP LEARNING
Machine learning allows computers to learn from data to improve their performance on certain
tasks. The main components of machine learning are the task (T), the performance measure (P),
and the experience (E). These three elements form the basis of any machine learning algorithm.
ud
1. The Task (T)
The task in machine learning is the problem that we want the system to solve. It could be
recognizing images, predicting numbers, translating languages, or even detecting fraud. The task
doesn’t include learning itself but refers to the goal or action we want the machine to perform.
• lo
Classification: The algorithm assigns an input (like an image) into one of several
categories. For example, identifying whether an image is of a cat or a dog is a classification
task.
C
• Regression: The algorithm predicts a continuous value, like forecasting house prices or
stock market trends.
• Transcription: The algorithm converts unstructured data into a structured format, such as
tu
recognizing text in images (optical character recognition) or converting speech into text.
• Machine Translation: Translating text from one language to another, like English to
French.
transactions.
• Structured Output: Tasks where the output involves multiple values that are connected,
such as generating captions for images.
• Synthesis and Sampling: The algorithm creates new data that is similar to the training
data, like generating realistic images or audio.
Page 9
21CS743 | DEEP LEARNING
• Imputation of Missing Values: Predicting missing data points based on the available
information.
• Denoising: Cleaning up corrupted data by predicting what the original data was before it
got corrupted.
• Density Estimation: Learning the probability distribution that explains how data points
are spread out in the dataset.
ud
2. The Performance Measure (P)
The performance measure tells us how well the machine learning algorithm is doing. It helps us
compare the system’s predictions with the actual results. Different tasks require different
performance measures.
lo
For example, in classification tasks, the performance measure might be accuracy, which tells us
how many predictions were correct. Alternatively, we can measure the error rate, which counts
how many predictions were wrong. In some cases, we may want a more detailed performance
C
measure, such as giving partial credit for partially correct answers.
For tasks that don’t involve predicting categories (like density estimation), accuracy isn’t useful,
so we use other performance measures, like log-probability.
tu
The experience refers to the data that the algorithm learns from. There are different types of
V
experiences:
• Supervised Learning: The system is trained using data that includes both input features
and their corresponding outputs or labels. For example, training a model with labeled
images of cats and dogs, so it learns to classify them.
Page 10
21CS743 | DEEP LEARNING
• Unsupervised Learning: The system is trained using data without labels. It tries to find
patterns or structure in the data, such as grouping similar data points together (clustering)
or estimating the data distribution (density estimation).
• Semi-Supervised Learning: Some examples in the training data have labels, but others
don’t. This is useful when getting labeled data is difficult or expensive.
ud
receiving feedback based on its actions. This approach is used in robotics and game
playing, where the system gets rewards or penalties based on the decisions it makes.
lo
To make the concept clearer, we can look at an example of a machine learning task called linear
regression, which predicts a continuous value. In linear regression, the algorithm uses the input
data (represented as a vector) to predict a value by calculating a linear combination of the input
features.
C
For example, if you want to predict the price of a house based on its size and location, the algorithm
might use a linear function to estimate the price. The output is calculated by multiplying the input
features by their corresponding weights and summing them up.
tu
The weights are the parameters that the algorithm adjusts during training. The goal is to find the
weights that minimize the mean squared error (MSE), which measures how far off the
predictions are from the actual values.
V
Page 11
21CS743 | DEEP LEARNING
ud
Supervised Learning Algorithms
lo
Supervised learning algorithms learn to map inputs (x) to outputs (y) using a training set. These
outputs often require human intervention but can also be collected automatically.
C
1. Probabilistic Supervised Learning
Most supervised learning algorithms estimate the probability of output yyy given input xxx,
represented as p(y∣x)p(y | x)p(y∣x). This can be done using maximum likelihood estimation,
tu
2. Logistic Regression
•
V
• For classification tasks (e.g., binary classification), we predict a class by squashing the
output into a probability between 0 and 1 using the logistic sigmoid function
σ(θTx)\sigma(θ^T x)σ(θTx).
• This technique is known as logistic regression. Despite its name, it is used for
classification, not regression.
Page 12
21CS743 | DEEP LEARNING
• Linear regression allows us to compute optimal weights using a simple formula (normal
equations).
• Logistic regression does not have a closed-form solution. Instead, the optimal weights are
found by minimizing the negative log-likelihood (NLL) using gradient descent.
ud
• k-NN is a non-parametric algorithm used for classification or regression. It doesn’t have a
traditional training phase; instead, it stores all training data.
• At test time, it finds the k-nearest neighbors of a test point and predicts the output by
averaging their values.
•
over classes.
lo
For classification, it averages over one-hot encoded vectors to get a probability distribution
Strength: k-NN can handle large datasets well and achieve high accuracy with enough
C
training examples.
• Weakness: It struggles with small datasets and computational efficiency, especially with
irrelevant features, as it treats all features equally.
tu
5. Decision Trees
• Decision Trees divide the input space into regions based on decisions made at each node
V
of the tree. Internal nodes make binary decisions, and leaf nodes map each region to a
constant output.
• Weakness: Decision trees may struggle with problems where decision boundaries aren’t
axis-aligned, requiring many nodes to approximate simple boundaries.
Page 13
21CS743 | DEEP LEARNING
Unsupervised learning algorithms deal with data that contains only features and no labeled targets.
They aim to extract meaningful patterns or structures from the data without human supervision,
and they are often used for tasks like clustering, density estimation, and learning data
representations.
ud
1. Goals of Unsupervised Learning
The main goal in unsupervised learning is often to find the best representation of the data. A
good representation preserves the most important information about the data while simplifying it
or making it easier to work with.
2. Types of Representations lo
There are three common types of data representations:
• Sparse Representations: Map the data into a higher-dimensional space where most of the
tu
values are zero. This structure makes the representation more efficient and reduces
redundancy.
• Reducing the dimensionality of the data helps with compression and makes it easier to find
and use the key features.
• Sparse and independent representations make the data easier to interpret and process in
machine learning algorithms.
Page 14
21CS743 | DEEP LEARNING
ud
1. Goals of PCA
PCA reduces the dimensionality of the data while ensuring that the new representation's features
are decorrelated (no linear correlations between the features). It is a step toward achieving
statistical independence of the features, though PCA only removes linear relationships.
•
lo
Linear Transformation: PCA projects the data onto new axes that capture the directions
of maximum variance in the data.
C
• The algorithm learns an orthogonal transformation that projects input xxx to a new
representation z=xTWz = x^T Wz=xTW, where WWW is a matrix of principal
components (the directions of maximum variance).
tu
• The first principal component explains the most variance in the data, and each subsequent
component captures the remaining variance, while being orthogonal to the previous ones.
• PCA transforms the data such that the covariance matrix of the new representation is
diagonal, meaning the new features are uncorrelated.
• The result is a compact, decorrelated representation of the data that can be used for further
analysis while minimizing information loss.
Page 15
21CS743 | DEEP LEARNING
k-Means Clustering
k-Means clustering is a simple and widely used unsupervised learning algorithm. It divides a
dataset into k clusters, grouping examples that are close to each other in the feature space. Each
data point is assigned to the nearest cluster, and the algorithm iteratively refines these clusters.
ud
1. How k-Means Works
• The algorithm begins by initializing k centroids (cluster centers), which are assigned
random values.
• Assignment Step: Each data point is assigned to the nearest centroid, forming clusters.
• Update Step: Each centroid is recalculated as the mean of the points assigned to it.
•
lo
This process repeats until the centroids no longer change significantly, signaling
convergence.
C
2. One-Hot Representation
• k-means clustering provides a one-hot representation for each data point. If a point
tu
belongs to cluster iii, its representation vector hhh has a 1 at position iii and 0 everywhere
else.
• This is an example of a sparse representation because only one element in the vector is
non-zero for each point.
V
• However, this representation is limited because it treats clusters as mutually exclusive and
doesn’t capture relationships between different clusters.
Page 16
21CS743 | DEEP LEARNING
3. Limitations of k-Means
• Ill-posed Problem: There is no single, definitive way to evaluate how well the clustering
reflects real-world structures. For example, clustering based on vehicle color (red vs. gray)
is as valid as clustering based on type (car vs. truck), but each reveals different information.
• Lack of Fine-Grained Similarity: k-means provides a strict one-hot output, which doesn’t
capture nuanced similarities between examples. For instance, it can’t show that red cars are
ud
more similar to gray cars than gray trucks.
•
lo
for each data point. For example, vehicles could be described by both color and type (e.g.,
car or truck), allowing for more detailed comparisons.
Distributed representations are more flexible and can capture complex relationships
between data points, reducing the burden on the algorithm to find a single attribute for
C
clustering.
tu
V
Page 17