Deep Learning Module-01 Search Creators
Deep Learning Module-01 Search Creators
Module-01
➢ Deep learning, a subset of machine learning, has revolutionized various fields by enabling
systems to learn and make decisions with minimal human intervention.
➢ At its core, deep learning leverages artificial neural networks with multiple layers (hence
"deep") to model complex patterns in data.
➢ This introduction provides an overview of deep learning models, their architectures,
applications, and significance in today's technological landscape.
❖ Deep learning involves training artificial neural networks computational models inspired
by the human brain to recognize patterns and make decisions based on vast amounts of
data.
❖ Unlike traditional machine learning, which may require feature engineering and manual
intervention, deep learning models automatically discover representations and features
from raw data, making them particularly effective for tasks like image and speech
recognition.
2. Layers:
4. Loss Function: Measures the difference between the model's predictions and the actual
outcomes, guiding the optimization process.
o Key Features: Incorporate loops to maintain information across time steps, making
them suitable for tasks where context is essential.
o Variants: Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)
networks address issues like vanishing gradients.
3. Transformer Models:
5. Autoencoders:
o Key Features: Comprise an encoder that compresses the data and a decoder that
reconstructs it.
Deep learning models have a wide array of applications across various industries:
• Automatic Feature Extraction: Eliminates the need for manual feature engineering,
allowing models to learn directly from raw data.
• Scalability: Can handle large volumes of data and complex models with millions of
parameters.
• Versatility: Applicable to diverse domains and tasks, from vision and speech to text and
beyond.
• Data Requirements: Deep learning models typically require vast amounts of labeled data,
which can be costly and time-consuming to obtain.
• Interpretability: Deep networks are often considered "black boxes," making it difficult to
understand how decisions are made.
• Overfitting: Models can become too tailored to training data, reducing their ability to
generalize to new, unseen data.
Deep learning, a branch of machine learning, has experienced tremendous growth and
transformation over the decades.
While its core principles date back to the mid-20th century, it has undergone several stages of
advancement due to technological innovations, better algorithms, and increased computational
power. Below is a timeline highlighting key historical trends in deep learning:
The foundation for deep learning lies in early research on neural networks and the imitation of
human cognition in machines. Several key milestones shaped the beginnings of the field:
• 1943: McCulloch and Pitts: The concept of a neuron as a binary classifier was introduced
by Warren McCulloch and Walter Pitts. They proposed a mathematical model of a neuron
that laid the groundwork for later neural network research.
• 1958: Perceptron by Frank Rosenblatt: The perceptron was a simple neural network
designed to perform binary classification tasks. It could learn by adjusting weights based
on input-output relationships, similar to modern deep learning models. However, its
limitations in handling non-linearly separable data, such as the XOR problem, restricted its
capabilities.
• 1960s: Backpropagation Concept Introduced: Although it wasn't widely used until much
later, the concept of backpropagation—the algorithm for training multilayer neural
networks—was introduced by multiple researchers, including Bryson and Ho.
After initial interest, neural networks entered a period of decline, often called the "AI winter."
There was disappointment in the limitations of single-layer perceptrons, and other machine
learning methods, such as support vector machines (SVMs) and decision trees, gained traction.
• 1970s: The limitations of early neural networks, like the perceptron, led to reduced funding
and enthusiasm for the approach.
• 1980s: Interest was revived through theoretical work, and some breakthroughs in deep
learning principles were laid during this period, though they wouldn’t be fully realized for
decades.
• 1989: Convolutional Neural Networks (CNNs) Introduced: Yann LeCun developed the
first CNN, LeNet, designed for image classification tasks. LeNet was able to recognize
handwritten digits and was used by banks to process checks, marking one of the earliest
practical applications of deep learning.
• 1990s: Recurrent Neural Networks (RNNs): Researchers like Jürgen Schmidhuber and
Sepp Hochreiter developed Long Short-Term Memory (LSTM) networks in 1997, solving
the problem of vanishing gradients in standard RNNs and allowing neural networks to
better handle sequential data.
• 2006: Deep Belief Networks (DBNs): Geoffrey Hinton and his team proposed the idea of
using deep belief networks, a type of unsupervised deep neural network. This marked the
beginning of modern deep learning, where the goal was to train deeper neural networks
that could learn complex representations.
• 2007–2009: GPU Acceleration: The adoption of Graphics Processing Units (GPUs) for
deep learning computations drastically improved the ability to train deeper networks faster.
This technological breakthrough allowed for more practical training of neural networks
with multiple layers.
The 2010s are often referred to as the "Golden Age" of deep learning. With the combination of
better hardware (especially GPUs), large datasets, and advanced algorithms, deep learning
achieved state-of-the-art performance across various domains.
• 2012: AlexNet and ImageNet Competition: A deep CNN called AlexNet, developed by
Alex Krizhevsky and Geoffrey Hinton, won the ImageNet Large Scale Visual Recognition
Challenge by a large margin. This victory demonstrated the power of deep learning in
image recognition and spurred widespread interest in the field.
• 2014:
• 2018–2019: Transfer Learning and Pre-trained Models: Large pre-trained models like
BERT (from Google) and GPT-2 (from OpenAI) demonstrated the power of transfer
learning, where a model pre-trained on massive datasets can be fine-tuned for specific tasks
with smaller datasets, drastically reducing training time and improving performance.
The 2020s have seen deep learning evolve further, with a focus on more efficient models, ethical
AI practices, and novel applications.
• Deep Reinforcement Learning: Deep learning has been integrated with reinforcement
learning to create AI agents capable of mastering complex environments. Breakthroughs
like AlphaGo and AlphaZero (developed by DeepMind) demonstrate the potential of AI
in learning strategies through trial and error in dynamic environments.
• Ethics and Interpretability: As deep learning models are increasingly deployed in real-
world applications, attention has shifted toward ensuring fairness, reducing biases, and
improving the interpretability of these "black box" models.
• Resource Efficiency: There has been a growing interest in optimizing deep learning
models to make them more resource-efficient, addressing concerns about the
environmental impact of training massive models. Techniques like pruning, quantization,
and distillation aim to reduce the computational and energy demands of deep learning
models.
Machine learning allows computers to learn from data to improve their performance on certain
tasks. The main components of machine learning are the task (T), the performance measure (P),
and the experience (E). These three elements form the basis of any machine learning algorithm.
The task in machine learning is the problem that we want the system to solve. It could be
recognizing images, predicting numbers, translating languages, or even detecting fraud. The task
doesn’t include learning itself but refers to the goal or action we want the machine to perform.
• Classification: The algorithm assigns an input (like an image) into one of several
categories. For example, identifying whether an image is of a cat or a dog is a classification
task.
• Regression: The algorithm predicts a continuous value, like forecasting house prices or
stock market trends.
• Transcription: The algorithm converts unstructured data into a structured format, such as
recognizing text in images (optical character recognition) or converting speech into text.
• Machine Translation: Translating text from one language to another, like English to
French.
• Structured Output: Tasks where the output involves multiple values that are connected,
such as generating captions for images.
• Synthesis and Sampling: The algorithm creates new data that is similar to the training
data, like generating realistic images or audio.
• Imputation of Missing Values: Predicting missing data points based on the available
information.
• Denoising: Cleaning up corrupted data by predicting what the original data was before it
got corrupted.
• Density Estimation: Learning the probability distribution that explains how data points
are spread out in the dataset.
The performance measure tells us how well the machine learning algorithm is doing. It helps us
compare the system’s predictions with the actual results. Different tasks require different
performance measures.
For example, in classification tasks, the performance measure might be accuracy, which tells us
how many predictions were correct. Alternatively, we can measure the error rate, which counts
how many predictions were wrong. In some cases, we may want a more detailed performance
measure, such as giving partial credit for partially correct answers.
For tasks that don’t involve predicting categories (like density estimation), accuracy isn’t useful,
so we use other performance measures, like log-probability.
The experience refers to the data that the algorithm learns from. There are different types of
experiences:
• Supervised Learning: The system is trained using data that includes both input features
and their corresponding outputs or labels. For example, training a model with labeled
images of cats and dogs, so it learns to classify them.
• Unsupervised Learning: The system is trained using data without labels. It tries to find
patterns or structure in the data, such as grouping similar data points together (clustering)
or estimating the data distribution (density estimation).
• Semi-Supervised Learning: Some examples in the training data have labels, but others
don’t. This is useful when getting labeled data is difficult or expensive.
To make the concept clearer, we can look at an example of a machine learning task called linear
regression, which predicts a continuous value. In linear regression, the algorithm uses the input
data (represented as a vector) to predict a value by calculating a linear combination of the input
features.
For example, if you want to predict the price of a house based on its size and location, the algorithm
might use a linear function to estimate the price. The output is calculated by multiplying the input
features by their corresponding weights and summing them up.
The weights are the parameters that the algorithm adjusts during training. The goal is to find the
weights that minimize the mean squared error (MSE), which measures how far off the
predictions are from the actual values.
Supervised learning algorithms learn to map inputs (x) to outputs (y) using a training set. These
outputs often require human intervention but can also be collected automatically.
Most supervised learning algorithms estimate the probability of output yyy given input xxx,
represented as p(y∣x)p(y | x)p(y∣x). This can be done using maximum likelihood estimation,
which finds the best parameters θ\thetaθ for a distribution.
2. Logistic Regression
• For classification tasks (e.g., binary classification), we predict a class by squashing the
output into a probability between 0 and 1 using the logistic sigmoid function
σ(θTx)\sigma(θ^T x)σ(θTx).
• This technique is known as logistic regression. Despite its name, it is used for
classification, not regression.
• Linear regression allows us to compute optimal weights using a simple formula (normal
equations).
• Logistic regression does not have a closed-form solution. Instead, the optimal weights are
found by minimizing the negative log-likelihood (NLL) using gradient descent.
• At test time, it finds the k-nearest neighbors of a test point and predicts the output by
averaging their values.
• For classification, it averages over one-hot encoded vectors to get a probability distribution
over classes.
• Strength: k-NN can handle large datasets well and achieve high accuracy with enough
training examples.
• Weakness: It struggles with small datasets and computational efficiency, especially with
irrelevant features, as it treats all features equally.
5. Decision Trees
• Decision Trees divide the input space into regions based on decisions made at each node
of the tree. Internal nodes make binary decisions, and leaf nodes map each region to a
constant output.
• Weakness: Decision trees may struggle with problems where decision boundaries aren’t
axis-aligned, requiring many nodes to approximate simple boundaries.
Unsupervised learning algorithms deal with data that contains only features and no labeled targets.
They aim to extract meaningful patterns or structures from the data without human supervision,
and they are often used for tasks like clustering, density estimation, and learning data
representations.
The main goal in unsupervised learning is often to find the best representation of the data. A
good representation preserves the most important information about the data while simplifying it
or making it easier to work with.
2. Types of Representations
• Sparse Representations: Map the data into a higher-dimensional space where most of the
values are zero. This structure makes the representation more efficient and reduces
redundancy.
• Reducing the dimensionality of the data helps with compression and makes it easier to find
and use the key features.
• Sparse and independent representations make the data easier to interpret and process in
machine learning algorithms.
1. Goals of PCA
PCA reduces the dimensionality of the data while ensuring that the new representation's features
are decorrelated (no linear correlations between the features). It is a step toward achieving
statistical independence of the features, though PCA only removes linear relationships.
• Linear Transformation: PCA projects the data onto new axes that capture the directions
of maximum variance in the data.
• The algorithm learns an orthogonal transformation that projects input xxx to a new
representation z=xTWz = x^T Wz=xTW, where WWW is a matrix of principal
components (the directions of maximum variance).
• The first principal component explains the most variance in the data, and each subsequent
component captures the remaining variance, while being orthogonal to the previous ones.
• PCA transforms the data such that the covariance matrix of the new representation is
diagonal, meaning the new features are uncorrelated.
• The result is a compact, decorrelated representation of the data that can be used for further
analysis while minimizing information loss.
k-Means Clustering
k-Means clustering is a simple and widely used unsupervised learning algorithm. It divides a
dataset into k clusters, grouping examples that are close to each other in the feature space. Each
data point is assigned to the nearest cluster, and the algorithm iteratively refines these clusters.
• The algorithm begins by initializing k centroids (cluster centers), which are assigned
random values.
• Assignment Step: Each data point is assigned to the nearest centroid, forming clusters.
• Update Step: Each centroid is recalculated as the mean of the points assigned to it.
• This process repeats until the centroids no longer change significantly, signaling
convergence.
2. One-Hot Representation
• k-means clustering provides a one-hot representation for each data point. If a point
belongs to cluster iii, its representation vector hhh has a 1 at position iii and 0 everywhere
else.
• This is an example of a sparse representation because only one element in the vector is
non-zero for each point.
• However, this representation is limited because it treats clusters as mutually exclusive and
doesn’t capture relationships between different clusters.
3. Limitations of k-Means
• Ill-posed Problem: There is no single, definitive way to evaluate how well the clustering
reflects real-world structures. For example, clustering based on vehicle color (red vs. gray)
is as valid as clustering based on type (car vs. truck), but each reveals different information.
• Lack of Fine-Grained Similarity: k-means provides a strict one-hot output, which doesn’t
capture nuanced similarities between examples. For instance, it can’t show that red cars are
more similar to gray cars than gray trucks.
• Distributed representations are more flexible and can capture complex relationships
between data points, reducing the burden on the algorithm to find a single attribute for
clustering.