ML Assignment 1
ML Assignment 1
Unit-1
Question 1: Explain the concept of a well-defined learning problem in machine learning. What are
the essential components that make a learning problem well-defined? Give suitable examples to
support your explanation. (12 marks)
A well-defined learning problem in machine learning (ML) is a clearly specified task where a model
learns from data to make predictions or decisions. According to Tom Mitchell, a learning problem is
well-defined if it involves a computer program learning from experience (E) with respect to a task (T)
and a performance measure (P), such that performance at T, as measured by P, improves with E.
Essential Components:
1. Task (T): The specific problem to be solved, such as classification, regression, or clustering.
The task must be precise to guide the model’s purpose.
2. Performance Measure (P): A quantitative metric to evaluate the model’s success. Common
metrics include accuracy, precision, recall, or mean squared error (MSE).
o Example: For spam detection, P could be the F1-score to balance precision and
recall.
3. Experience (E): The data or interactions used for learning, such as labeled datasets,
unlabeled data, or environmental feedback.
Examples:
o Performance Measure: F1-score to ensure both spam and non-spam emails are
correctly identified.
o Experience: A labeled dataset of emails with features like sender, subject, and
content.
o Well-defined because the task is specific, the metric is relevant, and the dataset is
sufficient.
o Task: Predict house prices based on features like size and location (regression).
1
o Performance Measure: Mean Absolute Error (MAE) to measure prediction accuracy.
Importance:
A well-defined problem ensures focus, enables objective evaluation, and aligns the model with
practical goals. Ambiguity in any component can lead to poor model performance or misaligned
outcomes.
• Figure 1: A flowchart:
Question 2(a): Describe in detail the different types of machine learning. (12 marks)
Machine learning (ML) is divided into three main types based on the learning process and data
availability: Supervised Learning, Unsupervised Learning, and Reinforcement Learning. Each has
unique characteristics, algorithms, and applications.
1. Supervised Learning:
• Definition: The model learns from labeled data, mapping inputs (features) to outputs
(labels).
• Subtypes:
• Algorithms: Linear Regression, Logistic Regression, Support Vector Machines (SVM), Neural
Networks.
• Example:
2. Unsupervised Learning:
• Definition: The model identifies patterns in unlabeled data without predefined outputs.
• Subtypes:
2
o Clustering: Grouping similar instances (e.g., customer segmentation).
• Example:
3. Reinforcement Learning:
• Example:
Comparison:
Practical Considerations:
3
• Supervised: Requires labeled data, which can be expensive.
Question 2(b): What considerations must be taken into account when designing a machine
learning system for a real-world application like email spam detection or medical diagnosis?
Explain with a step-by-step design approach. (12 marks)
Designing a machine learning system for real-world applications like email spam detection or
medical diagnosis requires careful consideration of data, model, evaluation, and deployment. Below
is a step-by-step design approach with considerations for spam detection as an example.
1. Problem Definition:
o Consideration: Clearly define the task (T), performance measure (P), and experience
(E).
o Spam Detection:
3. Feature Engineering:
o Spam Detection:
▪ Extract features like word counts (using TF-IDF), email length, or sender
reputation.
4
o Challenge: High-dimensional data may require dimensionality reduction (e.g., PCA).
4. Model Selection:
o Spam Detection:
▪ Algorithms: Naive Bayes (fast, good for text), Logistic Regression, or Neural
Networks.
▪ Naive Bayes is often preferred for its simplicity and effectiveness with text
data.
o Spam Detection:
o Consideration: Assess model performance with relevant metrics and refine based on
results.
o Spam Detection:
7. Deployment:
o Spam Detection:
5
o Privacy: Protect user data (e.g., anonymize emails or medical records).
o Bias: Ensure the model doesn’t discriminate (e.g., biased medical data may
misdiagnose certain groups).
• Figure 3: A flowchart:
Example Application:
• Spam Detection: A Naive Bayes model trained on TF-IDF features achieves 95% F1-score,
deployed in Gmail to filter spam in real-time.
• Medical Diagnosis: A Decision Tree model for diabetes prediction achieves 90% recall, used
in hospitals with explanations for doctors.
Question 2(c): Define machine learning. How does it differ from traditional programming? Why is
machine learning gaining importance in modern technology and industry? Support your answer
with examples and applications. (12 marks)
Machine learning (ML) is a subset of artificial intelligence where systems learn from data to make
predictions or decisions without being explicitly programmed for each task. ML algorithms identify
patterns in data (experience) to improve performance on a task, evaluated by a performance
measure.
Flexibility Rigid; requires manual rule updates. Adapts to new data; generalizes patterns.
Scalability Limited by rule complexity. Scales with data, handles complex patterns.
• Traditional Programming: A spam filter might use hand-coded rules (e.g., flag emails with
specific keywords). Changes in spam tactics require manual rule updates.
6
• Machine Learning: A spam filter learns from labeled emails, adapting to new patterns (e.g.,
Naive Bayes classifying based on word frequencies).
3. Automation Needs: ML automates tasks that are infeasible with rule-based systems.
1. Healthcare:
2. Finance:
3. E-commerce:
4. Autonomous Systems:
Future Outlook:
ML’s ability to handle unstructured data (text, images) and its integration with IoT, 5G, and edge
computing make it indispensable for innovation in smart cities, personalized medicine, and more.
7
Question 3(a): Discuss the major challenges and issues faced in machine learning. (12 marks)
Machine learning (ML) faces several challenges that impact model development, performance, and
deployment. Below are the major challenges with detailed explanations.
• Issue: Overfitting occurs when a model learns noise in training data, failing to generalize.
Underfitting occurs when the model is too simple to capture patterns.
• Example: A complex neural network for spam detection may overfit to specific keywords,
misclassifying new emails.
3. Imbalanced Datasets:
• Issue: Unequal class distributions skew model predictions toward the majority class.
• Example: In fraud detection, 99% of transactions are non-fraudulent, causing the model to
ignore fraud cases.
4. Computational Resources:
• Issue: Training large models (e.g., deep neural networks) requires significant computational
power and time.
• Example: Training a language model like GPT requires GPUs, which are costly for small
organizations.
5. Interpretability:
• Issue: Complex models (e.g., neural networks) are often “black boxes,” making it hard to
explain predictions.
• Example: In healthcare, doctors need explanations for ML-based diagnoses to trust the
system.
• Solution: Use interpretable models (e.g., Decision Trees) or explainability tools (e.g., SHAP).
8
• Issue: Models can perpetuate biases in data, leading to unfair outcomes.
• Example: Facial recognition systems may misidentify individuals from certain ethnic groups
due to biased training data.
• Issue: Deploying models in real-time systems requires low latency and robustness.
• Example: A spam filter must process emails instantly, but a complex model may be too slow.
Practical Implications:
Addressing these challenges requires a balance of technical expertise, domain knowledge, and
ethical considerations to ensure robust, fair, and practical ML systems.
Question 3(b): What is concept learning in machine learning? Describe the concept learning task
and its importance. Explain with an example how it is framed as a search problem. (12 marks)
Concept learning is a machine learning task where the goal is to learn a function that maps input
instances to a binary output (positive or negative), representing whether an instance belongs to a
specific concept. It involves inducing a general rule (hypothesis) from labeled examples.
• Input: A set of training examples, each with attributes and a label (positive/negative).
• Output: A hypothesis (rule) that correctly classifies instances as belonging to the concept.
• Example: Learning the concept “enjoyable sports day” based on attributes like weather and
temperature.
Importance:
9
Framing as a Search Problem:
Concept learning is framed as a search through a hypothesis space (all possible rules) to find a
hypothesis consistent with the training data.
• Hypothesis Space: The set of all possible hypotheses, defined by attribute constraints (e.g.,
specific values or wildcards).
• Search Goal: Find a hypothesis that correctly classifies all training examples.
• Search Strategy: Algorithms like Find-S or Candidate Elimination systematically explore the
hypothesis space.
Example:
• Attributes: Sky (Sunny, Cloudy, Rainy), Temperature (Warm, Cold), Wind (Strong, Weak).
• Training Examples:
Hypothesis Space:
• A hypothesis is a tuple [Sky, Temperature, Wind], where each element is a specific value, ?,
or ⊥.
• Example Hypotheses:
Search Representation:
10
• The search navigates the hypothesis space lattice, moving from specific to general
hypotheses based on positive examples.
o Path: Arrows showing generalization from [Sunny, Warm, Weak] to [Sunny, Warm, ?].
Practical Considerations:
• Applications: Concept learning is used in rule-based systems, decision trees, and more.
Question 4: Explain the general-to-specific ordering of hypotheses in concept learning. How does it
help in organizing the hypothesis space? Illustrate your answer with suitable diagrams and
examples. (12 marks)
General-to-Specific Ordering:
In concept learning, the general-to-specific ordering is a partial order over the hypothesis space,
where hypotheses are ranked by their generality. A hypothesis ( h_1 ) is more general than ( h_2 )
(denoted ( h_1 \geq_g h_2 )) if ( h_1 ) covers all instances covered by ( h_2 ) and possibly more.
Conversely, ( h_2 ) is more specific than ( h_1 ).
• General Hypothesis: Covers more instances (e.g., [?, ?, ?] covers all instances).
• Specific Hypothesis: Covers fewer instances (e.g., [Sunny, Warm, Weak] covers only exact
matches).
Formal Definition:
• ( h_1 \geq_g h_2 ) if, for every instance ( x ), if ( h_2(x) = 1 ) (positive), then ( h_1(x) = 1 ).
The general-to-specific ordering structures the hypothesis space as a lattice, enabling efficient
search:
11
• Search Efficiency: Algorithms like Candidate Elimination use this ordering to systematically
generalize or specialize hypotheses, narrowing the version space.
• Version Space: The set of hypotheses between the Specific (S) and General (G) boundaries,
consistent with training data.
Example:
• Hypotheses:
• Ordering:
o ( h_1 ) covers all instances, ( h_4 ) covers only one specific instance.
Training Example:
• Search:
o Negative example → Specialize ( G = {[Sunny, ?, ?], [?, Warm, ?], [?, ?, Weak]} ).
Benefits:
• Compact Representation: Version space reduces the need to enumerate all hypotheses.
12
Practical Considerations:
Question 5(a): What is the Find-S algorithm? Describe how it works for concept learning using an
example. What are its limitations and in which scenarios can it fail? (12 marks)
Find-S Algorithm:
The Find-S (Find Specific) algorithm is a concept learning algorithm that finds the most specific
hypothesis consistent with positive training examples. It assumes the target concept is within the
hypothesis space and ignores negative examples.
How It Works:
1. Initialize: Start with the most specific hypothesis ( h = [⊥, ⊥, ..., ⊥] ) (no instances are
positive).
3. Ignore Negative Examples: Find-S assumes negative examples are implicitly handled by the
specific hypothesis.
Example:
• Training Examples:
Execution:
1. Initialize: ( h = [⊥, ⊥, ⊥] ).
13
o Compare: Sky=Sunny (same), Temperature=Warm (same), Wind=Strong (differs from
Weak).
• Figure 8: A flowchart:
o Start: ( h = [⊥, ⊥, ⊥] ).
o Issue: Cannot refine the hypothesis using negative examples, leading to overly
general hypotheses.
2. Single Hypothesis:
o Issue: Outputs only one hypothesis, missing other valid hypotheses in the version
space.
o Scenario: Multiple hypotheses may fit the data, but Find-S picks the most specific.
3. Noise Sensitivity:
Practical Considerations:
• Use Case: Suitable for simple, noise-free datasets with conjunctive concepts.
14
Question 5(b): Explain the List-Then-Eliminate algorithm in detail. How does it differ from the Find-
S algorithm? Describe its working using a simple concept learning problem with a hypothesis
space. (12 marks)
List-Then-Eliminate Algorithm:
The List-Then-Eliminate algorithm is a concept learning algorithm that explicitly maintains the entire
version space (all hypotheses consistent with training data). It starts with all possible hypotheses and
eliminates those inconsistent with each training example.
How It Works:
2. Process Examples:
3. Output: The remaining hypotheses (version space) consistent with all examples.
Computational Cost Low (processes only positive examples) High (enumerates all hypotheses)
Example:
Dataset: Enjoy sports day (Sky: Sunny, Rainy; Wind: Strong, Weak).
• Training Examples:
15
1. (Sunny, Weak, Yes) → Positive
Execution:
1. Initialize: Version space = ( {h_1, h_2, h_3, h_4, h_5, h_6, h_7, ...} ).
o New version space: ( {h_2, h_7, ...} ) (e.g., [Sunny, ?], [Sunny, Weak]).
Practical Considerations:
Question 6(a): Describe the Candidate Elimination Algorithm and its working mechanism. How
does it use version spaces to narrow down the hypothesis set? Illustrate with an example showing
the General (G) and Specific (S) boundaries. (12 marks)
The Candidate Elimination Algorithm is a concept learning algorithm that maintains a version space
of all hypotheses consistent with training data, represented by two boundaries: the Specific (S)
16
boundary (most specific hypothesis) and the General (G) boundary (most general hypothesis). It
refines these boundaries with each example.
Working Mechanism:
1. Initialization:
2. Process Examples:
o Positive Example:
o Negative Example:
Version Space:
Example:
• Training Examples:
Execution:
1. Initialize:
o ( S_0 = [⊥, ⊥, ⊥] ).
o ( G_0 = [?, ?, ?] ).
17
o ( G_1 = [?, ?, ?] ) (no change).
o Specialize ( G_2 ): ( G_3 = {[Sunny, ?, ?], [?, Warm, ?], [?, ?, Weak]} ) (exclude Rainy,
Cold, Strong).
Practical Considerations:
• Advantage: Handles both positive and negative examples, robust to noise-free data.
Question 6(b): What is inductive bias in machine learning? Why is it necessary for a learning
algorithm? Compare the inductive biases of Find-S and Candidate Elimination algorithms. (12
marks)
Inductive Bias:
Inductive bias is the set of assumptions a learning algorithm makes to generalize from training data
to unseen data. It constrains the hypothesis space, enabling the algorithm to select a hypothesis
even when data is insufficient.
Why Necessary:
• Generalization: Without inductive bias, an algorithm cannot choose among hypotheses that
fit the training data equally well.
• Finite Data: Training data is limited, so bias guides predictions on unseen instances.
18
Example:
• For a dataset with points (x, y), multiple curves (linear, polynomial) may fit the data. An
algorithm’s bias (e.g., preferring linear functions) determines the chosen model.
• Bias: Assumes the target concept is a conjunctive rule (e.g., Sky=Sunny AND
Temperature=Warm) and the most specific hypothesis consistent with positive examples is
correct.
• Implication: Ignores negative examples, assumes the concept is within the conjunctive
hypothesis space.
• Example: For sports day data, Find-S outputs [Sunny, Warm, ?], assuming only conjunctive
rules are valid.
• Bias: Assumes the target concept is within the hypothesis space and can be represented by
conjunctive rules, but considers all hypotheses consistent with both positive and negative
examples.
• Implication: Maintains a version space, allowing for more flexible hypotheses between ( S )
and ( G ).
• Example: For the same data, CEA outputs a version space (e.g., [Sunny, Warm, ?] to [Sunny,
?, ?]), capturing multiple valid hypotheses.
Comparison:
Negative
Ignored Used to specialize ( G )
Examples
19
o Find-S: Small circle (single specific hypothesis).
Practical Considerations:
• Real-World: Both are limited by conjunctive bias; modern algorithms (e.g., neural networks)
use different biases.
20
Machine Learning Assignment Answer Booklet - Unit 2
Unit-2
Question 1(a): What is decision tree learning? Describe the decision tree learning algorithm in
detail. How is the best attribute selected at each node? Explain with an example using information
gain or Gini index. (12 marks)
Decision Tree Learning is a supervised machine learning method used for classification and
regression tasks. It builds a tree-like model where internal nodes represent tests on attributes,
branches represent outcomes of these tests, and leaf nodes represent class labels or continuous
values. The model makes decisions by traversing the tree from root to leaf based on input features.
The algorithm constructs the tree recursively using a top-down, greedy approach:
o Choose the attribute that best splits the data into homogeneous subsets, using a
metric like Information Gain, Gini Index, or Variance Reduction.
2. Create a Node:
4. Recurse:
o For each subset, repeat the process on the remaining attributes and data until:
▪ All instances in a subset belong to the same class (create a leaf with that
class).
5. Pruning (Optional):
o Post-process the tree to remove branches that overfit, using techniques like reduced
error pruning or cost-complexity pruning.
21
IG(A, S) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)
]
where ( S_v ) is the subset of ( S ) for value ( v ) of ( A ).
Example:
Dataset: Predicting whether to play tennis (attributes: Outlook, Temperature, Humidity, Wind).
• Outlook:
Tree:
Outlook
├── Sunny → No
22
├── Overcast → Yes
o Root: Outlook.
Practical Considerations:
Question 1(b): What is inductive bias in the context of decision tree learning? Discuss the inductive
bias of decision tree algorithms such as ID3 and C4.5. How does this bias affect learning outcomes?
(12 marks)
Inductive Bias:
Inductive bias is the set of assumptions a learning algorithm makes to generalize from training data
to unseen data. It constrains the hypothesis space, enabling the algorithm to select a hypothesis
when multiple fit the data.
In decision tree learning, the inductive bias determines how the algorithm constructs the tree and
generalizes. The key biases are:
o Decision trees prefer shorter trees with fewer nodes, assuming simpler models
generalize better (Occam’s Razor).
o Algorithms like ID3 and C4.5 select attributes that maximize information gain (or gain
ratio) at each step, assuming locally optimal splits lead to globally optimal trees.
3. Conjunctive Splits:
o Trees represent conjunctive rules (e.g., “if Outlook=Sunny AND Humidity=High, then
No”), assuming the target concept can be expressed this way.
23
• Bias: Prefers trees that maximize information gain at each node, favoring attributes with
many values (e.g., unique IDs). Assumes the target concept is representable as a tree and
simpler trees are better.
• Implication: May overfit by selecting attributes with high granularity, leading to complex
trees.
• Example: In the tennis dataset, ID3 selects Outlook because it maximizes information gain,
assuming this split best separates classes.
• Bias: Extends ID3 by using gain ratio (normalizes information gain by attribute entropy) to
reduce bias toward attributes with many values. It also supports pruning, assuming pruned
trees generalize better.
• Implication: More robust to overfitting than ID3, better handles noisy data.
• Example: C4.5 might avoid selecting an attribute like “Email ID” (high granularity, low
generalization) by normalizing gain.
• Generalization:
o ID3’s bias toward high-gain attributes can lead to overfitting, especially with noisy or
high-dimensional data.
o C4.5’s gain ratio and pruning improve generalization by favoring simpler, more robust
trees.
• Interpretability:
o Both produce interpretable trees, but C4.5’s pruning ensures simpler rules, aiding
user understanding.
• Performance:
o ID3 may perform well on training data but poorly on test data due to overfitting.
o C4.5 balances training and test performance, making it more practical for real-world
applications.
Example:
o ID3: May select “Loan ID” (unique per instance) due to high information gain, leading
to an overfit tree.
o C4.5: Prefers “Income” or “Credit Score” (via gain ratio), producing a generalizable
tree.
24
o Right: C4.5 tree (simpler, pruned).
Practical Considerations:
• ID3: Suitable for small, clean datasets but less practical for real-world noisy data.
• C4.5: Preferred for robustness, used in applications like medical diagnosis or fraud detection.
Question 2(a): Discuss the key issues in decision tree learning. Explain challenges such as
overfitting, noisy data, bias in attribute selection, and missing values. Suggest methods to address
these issues. (12 marks)
Decision trees are powerful but face several challenges that affect performance and reliability. Below
are the major issues with solutions.
1. Overfitting:
o Issue: Trees become too complex, capturing noise in training data, leading to poor
generalization.
o Example: A tree for the tennis dataset may split on specific humidity values, failing
on new data.
o Solution:
▪ Pruning: Remove branches with low predictive power (e.g., reduced error
pruning, cost-complexity pruning).
2. Noisy Data:
o Example: A mislabeled tennis example (Sunny, Warm, Weak, No) may lead to an
incorrect split.
o Solution:
o Issue: Metrics like information gain favor attributes with many values (e.g., unique
IDs), leading to poor generalization.
25
o Example: In a customer dataset, “Customer ID” may have high gain but no predictive
power.
o Solution:
4. Missing Values:
o Solution:
Practical Implications:
• Overfitting: Pruning and ensemble methods improve test accuracy but increase
computation.
• Noisy Data: Robust preprocessing is critical for real-world applications like medical diagnosis.
• Attribute Bias: Gain ratio and feature selection enhance model relevance.
• Missing Values: Handling strategies ensure robustness but may introduce bias if imputation
is inaccurate.
o Branches: Overfitting, Noisy Data, Attribute Bias, Missing Values (with solutions).
Example Application:
• Fraud Detection: A decision tree overfits to noisy transaction data. Using C4.5 with pruning
and gain ratio, the model achieves 90% accuracy on test data, handling missing values via
imputation.
26
Question 2(b): What is a perceptron in neural networks? Explain the perceptron learning algorithm
for binary classification. How does the perceptron update its weights? Illustrate with an example.
(12 marks)
Perceptron:
A perceptron is a simple neural network unit used for binary classification. It takes input features,
computes a weighted sum, applies an activation function (e.g., step function), and outputs a class
label (0 or 1). It models a linear decision boundary.
The algorithm trains the perceptron by adjusting weights to minimize classification errors:
1. Initialize:
o Weights (( w_1, w_2, ..., w_n )) and bias (( b )) to small random values or zeros.
2. Compute Output:
3. Update Weights:
4. Iterate:
• If ( t = y ): No update.
27
Example:
x1 x2 Label
1 1 1
2 2 1
1 0 0
Execution (η = 0.1):
o ( z = 0 \cdot 1 + 0 \cdot 1 + 0 = 0 ).
o ( y = 1 ) (correct, no update).
o ( z = 0 ).
o ( y = 1 ) (correct, no update).
o ( z = 0 ).
o ( y = 1 ) (wrong, ( t = 0 )).
o Update:
[
w_1 \gets 0 + 0.1 (0 - 1) \cdot 1 = -0.1
]
[
w_2 \gets 0 + 0.1 (0 - 1) \cdot 0 = 0
]
[
b \gets 0 + 0.1 (0 - 1) = -0.1
]
Final Weights: After convergence (e.g., ( w_1 = 0.2 ), ( w_2 = 0.2 ), ( b = -0.3 )), the perceptron
separates classes.
28
o Summation: ( z = w_1 x_1 + w_2 x_2 + b ).
Practical Considerations:
Question 3(a): Explain the concept of gradient descent and the delta rule in training neural
networks. How do these methods help in minimizing error? Provide a mathematical explanation.
(12 marks)
Gradient Descent:
Delta Rule:
The Delta Rule (or Widrow-Hoff rule) is a specific application of gradient descent for training linear
models like perceptrons or Adaline. It updates weights to minimize the mean squared error (MSE)
between predicted and actual outputs.
Mathematical Explanation:
1. Loss Function:
2. Gradient Descent:
29
y = \sum_j w_j x_j + b
]
Gradient:
[
\frac{\partial L}{\partial w_j} = (t - y) \cdot x_j
]
Update:
[
w_j \gets w_j + \eta (t - y) x_j
]
3. Delta Rule:
o The delta rule is the gradient descent update for linear models:
[
\Delta w_j = \eta (t - y) x_j
]
• Gradient Descent: Iteratively moves weights toward the loss function’s minimum by
following the steepest descent direction.
• Delta Rule: Specifically minimizes MSE for linear models, ensuring predictions align with true
outputs.
Example:
• Error: ( t - y = 3 - 0 = 3 ).
• Update:
[
w_1 \gets 0 + 0.1 \cdot 3 \cdot 1 = 0.3
]
[
w_2 \gets 0 + 0.1 \cdot 3 \cdot 1 = 0.3
]
[
b \gets 0 + 0.1 \cdot 3 = 0.3
]
30
• Figure 5: A gradient descent diagram:
o Y-axis: Loss ( L ).
Practical Considerations:
• Convergence: Proper ( \eta ) ensures convergence; too high causes divergence, too low slows
learning.
• Applications: Gradient descent is used in neural networks, logistic regression; delta rule in
perceptrons, Adaline.
Question 3(b): What is Adaline? How does it differ from a simple perceptron? Explain the working
of Adaline with a focus on the learning rule and cost function used. (12 marks)
Adaline:
Adaline (Adaptive Linear Neuron) is a single-layer neural network model for binary classification or
regression. It computes a weighted sum of inputs, applies a linear output (not a step function), and
uses gradient descent to minimize the mean squared error (MSE).
Convergence Only for linearly separable data More robust, even for noisy data
Working of Adaline:
1. Architecture:
2. Cost Function:
o MSE:
[
J(w, b) = \frac{1}{2n} \sum_{i=1}^n (t_i - z_i)^2
31
]
where ( t_i ) is the true output, ( z_i ) is the linear output.
4. Training:
Example:
x1 x2 Label
1 1 1
1 0 0
o ( z = 0 \cdot 1 + 0 \cdot 1 + 0 = 0 ).
o Error: ( t - z = 1 - 0 = 1 ).
o Update:
[
w_1 \gets 0 + 0.1 \cdot 1 \cdot 1 = 0.1
]
[
w_2 \gets 0 + 0.1 \cdot 1 \cdot 1 = 0.1
]
[
b \gets 0 + 0.1 \cdot 1 = 0.1
]
32
o Output: Linear ( z ), threshold for classification.
Practical Considerations:
Question 4(a): What are multilayer neural networks? Explain the architecture and significance of
hidden layers. How do multilayer networks solve non-linearly separable problems? (12 marks)
Multilayer Neural Networks (MLPs) are neural networks with one or more hidden layers between the
input and output layers. They are used for complex tasks like classification, regression, and pattern
recognition, capable of modeling non-linear relationships.
Architecture:
1. Input Layer: Receives features (e.g., ( x_1, x_2, ..., x_n )).
2. Hidden Layers:
o Contain neurons that transform inputs via weights, biases, and activation functions
(e.g., ReLU, sigmoid).
4. Connections: Fully connected layers where each neuron in one layer connects to every
neuron in the next.
• Feature Transformation: Hidden layers learn hierarchical features (e.g., edges in images →
shapes → objects).
• Non-Linearity: Non-linear activation functions (e.g., ReLU) enable modeling of complex, non-
linear relationships.
• Capacity: More layers/neurons increase the model’s ability to capture intricate patterns.
• Perceptron Limitation: Single-layer perceptrons fail for non-linearly separable data (e.g., XOR
problem).
• MLP Solution:
33
o Example: For XOR (( (0,0) \rightarrow 0 ), ( (0,1) \rightarrow 1 ), ( (1,0) \rightarrow 1
), ( (1,1) \rightarrow 0 )):
▪ A single hidden layer with two neurons can learn to separate XOR by
combining linear boundaries into a non-linear one.
Example:
• Architecture:
Practical Considerations:
Question 4(b): Explain the backpropagation algorithm step-by-step. How does it work in training
multilayer perceptrons (MLPs)? Illustrate with an example and the flow of error and weight
updates. (12 marks)
Backpropagation Algorithm:
Step-by-Step:
1. Forward Pass:
34
o Compute the output of each neuron from input to output layer using weights,
biases, and activation functions.
2. Backward Pass:
o Compute the gradient of the loss with respect to each weight using the chain rule.
3. Weight Update:
o Update weights:
[
w_{ij} \gets w_{ij} + \eta \delta_j x_i
]
where ( x_i ) is the input to the neuron, ( \eta ) is the learning rate.
4. Iterate:
Example:
• Architecture:
• Data: (1, 1) → 0.
• Forward Pass:
35
o Loss: ( L = \frac{1}{2} (0 - y)^2 ).
• Backward Pass:
Practical Considerations:
Question 5(a): What is convergence in the context of training neural networks? Discuss the factors
that affect convergence such as learning rate, activation functions, and data normalization. (12
marks)
Convergence:
Convergence in neural network training occurs when the optimization algorithm (e.g., gradient
descent) reaches a state where the loss function stabilizes at a minimum (global or local), and further
iterations yield negligible improvements.
o Impact:
2. Activation Functions:
36
o Impact:
▪ ReLU: Faster convergence but dying ReLU problem (neurons output 0).
3. Data Normalization:
o Role: Scales input features to a standard range (e.g., [0,1] or mean=0, std=1).
o Impact:
▪ Example: Features with different scales (e.g., age vs. income) skew weight
updates.
Example:
o With normalization (pixels to [0,1]), gradients are stable, converging in fewer epochs.
o Using ReLU and Adam (( \eta = 0.001 )) ensures fast, stable convergence.
o X-axis: Epochs.
o Y-axis: Loss.
o Curves: High ( \eta ) (diverges), Low ( \eta ) (slow), Optimal ( \eta ) (converges).
Practical Considerations:
Question 5(b): Define generalization in machine learning. How do neural networks generalize to
unseen data? Discuss techniques such as regularization, dropout, and early stopping to improve
generalization. (12 marks)
Generalization:
37
Generalization in machine learning is the ability of a model to perform well on unseen (test) data,
not just training data. A model generalizes if it learns the underlying patterns rather than memorizing
the training set.
• Neural networks generalize by learning features (e.g., edges in images) that are relevant
across the data distribution.
1. Regularization:
2. Dropout:
3. Early Stopping:
o Method: Monitor validation loss and stop training when it stops decreasing.
o Impact: Prevents overfitting by halting before the model memorizes training data.
o Example: Training an MLP stops after 50 epochs when validation loss plateaus,
ensuring good generalization.
Example:
o With L2 regularization, dropout (50%), and early stopping: Test accuracy improves to
90%.
38
o X-axis: Epochs.
o Y-axis: Accuracy.
o Curves: Training (increases), Test (peaks then drops without techniques, stable with
techniques).
Practical Considerations:
• Trade-off: Regularization may reduce training accuracy but improves test performance.
• Applications: Essential for deep learning in real-world tasks like autonomous driving, NLP.
39
Machine Learning Assignment Answer Booklet - Unit 3
Unit-3
Question 1(a): What are the key metrics used to evaluate machine learning models? Discuss in
detail evaluation techniques like accuracy, precision, recall, F1-score, ROC curve, and AUC. Provide
examples where each metric is most appropriate. (12 marks)
Evaluating machine learning models is essential to assess their performance and suitability for
specific tasks. The key metrics include Accuracy, Precision, Recall, F1-Score, ROC Curve, and AUC,
each suited to different scenarios based on the problem’s requirements and data characteristics.
1. Accuracy:
2. Precision:
o When Appropriate: When false positives are costly (e.g., spam detection, where
marking legitimate emails as spam is undesirable).
o Example: In spam detection, precision of 0.95 means 95% of emails flagged as spam
are actually spam.
3. Recall (Sensitivity):
o When Appropriate: When false negatives are costly (e.g., medical diagnosis, where
missing a disease is critical).
o Example: In cancer detection, recall of 0.98 means 98% of cancer cases are
identified.
4. F1-Score:
40
o When Appropriate: Imbalanced datasets or when balancing precision and recall is
needed.
o Example: In fraud detection, an F1-score of 0.85 balances catching fraud (recall) and
avoiding false alarms (precision).
o Definition: A plot of True Positive Rate (Recall) vs. False Positive Rate (( FPR =
\frac{FP}{FP + TN} )) at various classification thresholds.
o Example: In disease prediction, the ROC curve shows how well the model
distinguishes diseased from healthy patients. A curve closer to the top-left indicates
better performance.
o Definition: The area under the ROC curve, ranging from 0 to 1 (1 = perfect model, 0.5
= random guessing).
Practical Scenarios:
41
Practical Considerations:
• Metric Selection: Choose based on cost of errors (e.g., false positives vs. false negatives).
Question 1(b): Explain the process of model selection in machine learning. What is the role of
cross-validation in model selection? Describe different cross-validation strategies such as k-fold,
stratified k-fold, and leave-one-out. (12 marks)
Model selection is the process of choosing the best machine learning model and its hyperparameters
to achieve optimal performance on unseen data. It involves comparing different algorithms (e.g.,
SVM, Random Forest) and tuning their parameters (e.g., learning rate, tree depth).
1. Define the Problem: Specify the task (classification, regression), performance metric (e.g.,
accuracy, F1-score), and dataset.
2. Select Candidate Models: Choose a set of algorithms based on the problem (e.g., linear
models for simple data, neural networks for complex data).
3. Split Data: Divide the dataset into training, validation, and test sets to prevent overfitting.
5. Hyperparameter Tuning:
o Use techniques like grid search or random search to find optimal hyperparameters.
7. Final Selection: Choose the model with the best validation performance and evaluate on the
test set.
Role of Cross-Validation:
Cross-validation (CV) is a technique to assess model performance by splitting the data into multiple
subsets, training on some subsets, and validating on others. It:
• Maximizes Data Use: Ensures all data is used for both training and validation.
42
Cross-Validation Strategies:
1. K-Fold Cross-Validation:
o Process: Divide the dataset into ( k ) equal folds. Train on ( k-1 ) folds, validate on the
remaining fold. Repeat ( k ) times, each fold serving as the validation set once.
Average the performance metrics.
o Example: For ( k=5 ), split 1000 samples into 5 folds of 200. Train on 800, validate on
200, repeat 5 times.
o Process: Similar to k-fold, but ensures each fold has the same class distribution as
the original dataset.
o Example: In a dataset with 80% class A and 20% class B, each fold maintains this
ratio.
o Process: For ( n ) samples, train on ( n-1 ) samples, validate on the remaining one.
Repeat ( n ) times.
o Example: For 100 samples, train on 99, validate on 1, repeat 100 times.
Practical Considerations:
• K-Fold: Use ( k=5 ) or ( k=10 ) for balance between computation and reliability.
• LOOCV: Suitable for small datasets but impractical for large ones.
• Applications: Model selection for tasks like spam detection, medical diagnosis.
43
Question 2(a): What is the bias-variance tradeoff in machine learning? How does it affect model
selection and generalization? Explain with graphical representation and real-world implications.
(12 marks)
Bias-Variance Tradeoff:
The bias-variance tradeoff is a fundamental concept in machine learning that describes the balance
between a model’s ability to fit the training data (low bias) and its ability to generalize to unseen data
(low variance). The total expected error of a model is: [ Error = Bias^2 + Variance + Irreducible Error ]
• Bias: Error due to overly simplistic models (underfitting). High-bias models fail to capture
data patterns (e.g., linear model for non-linear data).
Graphical Representation:
▪ Y-axis: Error.
▪ Curves: Bias (decreases with complexity), Variance (increases), Total Error (U-
shaped, minimum at optimal complexity).
Real-World Implications:
44
o High Bias: Linear regression assumes linear relationships, underfits (high error on
training/test).
o High Variance: A 10-layer neural network overfits to training data quirks, performs
poorly on test data.
o Balanced: A Random Forest with tuned depth balances fit and generalization,
achieving low test error.
• Applications:
o Fraud Detection: High-bias models may miss subtle fraud patterns. Tuning for
balance improves detection rates.
Practical Considerations:
• Model Selection: Use cross-validation to find the sweet spot (e.g., optimal tree depth).
Question 2(b): What are ensemble methods in machine learning? Explain the concept of ensemble
learning and how it helps in improving the performance of weak learners. (12 marks)
Ensemble Methods:
Ensemble methods combine multiple machine learning models (base learners) to create a stronger
model with better performance than any individual learner. They leverage diversity among models to
improve accuracy, robustness, and generalization.
• Definition: Ensemble learning aggregates predictions from multiple models (e.g., decision
trees, neural networks) to make a final prediction, often via voting (classification) or
averaging (regression).
• Weak Learners: Models that perform slightly better than random guessing (e.g., shallow
decision trees).
45
o Train multiple models on random subsets of data (with replacement).
2. Boosting:
3. Stacking:
Example:
• Mechanism: Each tree may overfit to different noise, but averaging reduces variance,
improving performance.
Practical Considerations:
Question 3(a): Explain the Bagging (Bootstrap Aggregation) method. How does Random Forest
build on the Bagging technique? Discuss the working of Random Forest with its advantages and
limitations. (12 marks)
46
Bagging is an ensemble method that reduces variance by training multiple models on random
subsets of the training data and aggregating their predictions.
Bagging Process:
1. Bootstrap Sampling:
2. Train Models:
3. Aggregate Predictions:
o Regression: Averaging.
Random Forest:
Random Forest builds on Bagging by introducing random feature selection to increase model
diversity:
• Modifications:
o At each node, only a random subset of features (e.g., ( \sqrt{n} ) for ( n ) features) is
considered for splitting.
3. Train Trees:
▪ At each node, select a random subset of features and choose the best split.
4. Predict:
47
5. Output: Final prediction.
Example:
o Random Forest: 100 trees, each trained on a bootstrap sample, using ( \sqrt{10}
\approx 3 ) features per split.
Advantages:
Limitations:
Practical Considerations:
Question 3(b): What is Boosting in machine learning? Describe the working of popular boosting
algorithms like AdaBoost and Gradient Boosting. How do they differ from bagging methods? (12
marks)
Boosting:
Boosting is an ensemble method that combines multiple weak learners (e.g., shallow trees)
sequentially, where each learner focuses on correcting the errors of its predecessors, improving
overall performance.
Working of Boosting:
48
1. Initialize: Assign equal weights to all training samples.
2. Train Sequentially:
3. Combine:
• Process:
▪ Normalize weights.
3. Final prediction: Weighted majority vote: ( \hat{y} = \text{sign} \left( \sum_t \alpha_t
h_t(x) \right) ).
• Example: Classifying spam emails using stumps, AdaBoost achieves 90% accuracy by focusing
on hard-to-classify emails.
Gradient Boosting:
• Process:
• Example: Predicting house prices, Gradient Boosting reduces MSE by fitting trees to
residuals.
49
Differences from Bagging:
Practical Considerations:
Question 4(a): Compare and contrast Bagging and Boosting. Highlight their approach to error
reduction, bias-variance impact, and use-cases with suitable examples. (12 marks)
50
Aspect Bagging Boosting
Computational
Moderate (parallelizable) High (sequential)
Cost
• Bagging:
• Boosting:
Bias-Variance Impact:
• Bagging:
o Reduces variance by averaging, but base learners may have high bias (e.g., deep
trees still overfit if not diverse).
o Example: Random Forest reduces variance in fraud detection, but shallow trees may
underfit complex patterns.
• Boosting:
o Reduces bias by iteratively fitting to errors, also reduces variance through weighted
combination.
o Example: Gradient Boosting fits complex house price patterns (low bias), maintains
low variance with proper tuning.
Use-Cases:
51
o Left: Bagging (parallel trees, voting).
Practical Considerations:
Question 4(b): What is rule-based learning in machine learning? How does it differ from decision
tree learning? Describe the process of learning a set of if-then rules from data. (12 marks)
Rule-Based Learning:
Rule-based learning is a machine learning approach that induces a set of if-then rules from data to
make predictions or classifications. Each rule has a condition (if) and a conclusion (then),
representing patterns in the data.
Rules applied independently, combined via ordering or Traverse tree from root to
Prediction
voting leaf
Represents conjunctive
Flexibility Can represent disjunctive concepts
paths
Interpretability Rules are modular, easy to understand Tree paths can be complex
2. Rule Induction:
o Generate a rule by finding conditions (attribute-value pairs) that cover many positive
instances and few negatives.
o Example: For spam detection, induce “If subject contains ‘win’, then spam.”
3. Evaluate Rule:
52
4. Remove Covered Instances:
5. Repeat:
o Induce new rules for remaining instances until all are covered or a stopping criterion
is met.
6. Post-Processing:
Example:
o Rule 1: If Income > 50K AND Credit Score > 700, then Approve.
o Process: Induce Rule 1 (covers 60% approvals), remove covered instances, induce
Rule 2, etc.
Practical Considerations:
Question 5(a): What are the advantages and limitations of rule learning approaches? In which
scenarios are rule-based systems preferable over other models like decision trees or neural
networks? (12 marks)
1. Interpretability:
o Rules are human-readable (e.g., “If age > 30, then high risk”), aiding transparency.
2. Modularity:
53
o Example: Adding a new fraud detection rule without retraining.
3. Flexibility:
4. Incremental Learning:
Limitations:
1. Overfitting:
o Example: A rule for specific spam keywords may fail on new spam types.
2. Scalability:
3. Coverage Issues:
4. Complexity in Learning:
o Medical Systems: Doctors need clear rules (e.g., “If fever > 100°F, then test for
infection”).
o Business Rules: Retail systems use rules like “If purchase > $100, then offer
discount.”
• Incremental Updates:
o Fraud Detection: New fraud patterns require adding rules without retraining.
o Expert Systems: Rules for simple domains (e.g., loan approval) are effective.
• Regulatory Compliance:
54
o Finance: Rules ensure transparency and compliance (e.g., “If credit score < 600,
reject loan”).
• Decision Trees: More compact but less flexible for disjunctive concepts.
• Preference: Rule-based systems are chosen when interpretability and modularity outweigh
predictive power.
Practical Considerations:
• Use Case: Best for domains requiring transparency and rule-based logic.
Question 5(b): Discuss common algorithms for rule learning such as RIPPER or CN2. How are rules
evaluated and pruned to avoid overfitting? Include the role of coverage, accuracy, and rule
ordering. (12 marks)
o Process:
o Example: For spam detection, RIPPER might learn “If ‘win’ in subject, then spam,”
then prune to generalize.
2. CN2:
o Process:
▪ Uses a beam search to find the best rule based on a quality metric (e.g.,
accuracy).
55
▪ Supports ordered or unordered rule sets.
o Example: CN2 learns rules for loan approval, evaluating each for accuracy and
coverage.
Rule Evaluation:
• Other Metrics: Information gain, Laplace estimate (adjusts for small samples).
• Pruning:
o Example: Change “If age=30 AND income=50K, then approve” to “If income=50K,
then approve.”
• Methods:
Rule Ordering:
• Ordered Rules: Rules are applied sequentially; the first matching rule determines the
prediction.
o Example: “If income > 50K, approve; else if credit < 600, reject; else approve.”
• Role: Ordering affects interpretability and performance; ordered rules are more
interpretable.
Example:
56
• Dataset: Loan approval.
o RIPPER:
▪ Rule: “If income > 50K AND credit > 700, then approve” (coverage: 60%,
accuracy: 95%).
▪ Prune to “If income > 50K, then approve” (coverage: 70%, accuracy: 90%).
Practical Considerations:
57