Machine Learning
Machine Learning
Machine Learning
Limitations
Prone to Overfitting: Decision trees can become too complex, capturing
noise instead of the underlying pattern.
Instability: A small change in the data can result in a completely different
tree.
Suboptimal Splits: Greedy algorithms used in tree building may not find
the global best split.
Enhancements
To overcome these limitations, advanced techniques are often used:
Pruning: Removing unnecessary branches to simplify the tree and
reduce overfitting.
Ensemble Methods:
o Random Forest: Combines multiple decision trees to improve
accuracy and robustness.
o Gradient Boosting: Builds trees sequentially to correct errors from
previous ones.
Example
Suppose you want to classify whether someone buys a product. A decision tree
might look like this:
Age < 30?
/ \
Yes No
/ \
Income > 50K? Buys: Yes
/ \
Yes No
Buys: Yes Buys: No
This tree shows decisions based on age and income to predict if someone will
make a purchase.
Q-2 Explain support vector machine.
Support Vector Machine (SVM) is a supervised machine learning algorithm
commonly used for classification and regression tasks. It works by finding the
best boundary (or hyperplane) that separates data points into different classes.
Here's a breakdown of its components and working principles:
Advantages
Works well for high-dimensional data.
Effective for both linearly and non-linearly separable data using kernels.
Robust to overfitting if parameters are chosen correctly.
Disadvantages
Computationally expensive for very large datasets.
Choice of kernel and tuning parameters like CC and kernel-specific
parameters (e.g., gamma for RBF) can be complex.
Applications
Text classification (e.g., spam detection).
Image recognition.
Bioinformatics (e.g., cancer classification based on gene expression
data).
SVM’s strength lies in its ability to generalize well even in cases where the
dimensionality of the data is high relative to the number of samples.
Advantages of K-NN
Simple to understand and implement.
No assumptions about the data distribution.
Effective for small datasets with well-separated classes.
Disadvantages of K-NN
Computationally expensive: Finding distances for all data points can be
slow, especially for large datasets.
Memory-intensive: It stores all training data.
Sensitive to irrelevant features: Including noisy or irrelevant features
can distort the distance calculation.
Performance depends heavily on choice of KK and distance metric.
Applications
Pattern recognition: Face, handwriting, and speech recognition.
Recommendation systems.
Medical diagnosis: Predicting diseases based on symptoms.
Example
Imagine we want to classify a new flower based on its petal length and width.
Given a dataset of flowers with known species, K-NN will:
1. Measure distances between the new flower and all existing flowers in
the dataset.
2. Identify the KK nearest flowers.
3. Assign the species that appears most frequently among the KK nearest
flowers.
Would you like an implementation example of K-NN in Python?
Applications
Finance: Stock price prediction, risk analysis.
Retail: Sales forecasting, inventory management.
Energy: Demand forecasting, price prediction.
Weather: Temperature, precipitation prediction.
Healthcare: Patient inflow prediction, disease spread monitoring.
Time-series forecasting combines domain expertise with statistical and
computational methods to generate actionable insights for future planning
and decision-making.
Applications of Clustering
1. Market Segmentation:
o Group customers based on purchasing behavior or
demographics.
2. Image Segmentation:
o Divide an image into meaningful regions (e.g., separating objects
from the background).
3. Anomaly Detection:
o Identify outliers in data, such as fraud detection or system
failures.
4. Document Clustering:
o Organize documents or text data into related topics.
5. Genomics and Bioinformatics:
o Group genes with similar expressions or classify DNA sequences.
Visualization of Clusters
Clustering results are often visualized in 2D or 3D using techniques like:
Scatter Plots: For simple data distributions.
Dimensionality Reduction: Techniques like PCA or t-SNE to reduce high-
dimensional data to a visualizable form.
Would you like a deeper explanation of any specific clustering algorithm or its
implementation?
Purpose of PCA:
1. Dimensionality Reduction: Simplify datasets with many variables while
retaining most of the important information.
2. Visualization: Reduce high-dimensional data to 2D or 3D for
visualization.
3. Feature Extraction: Identify and use the most important patterns or
features in the data.
Applications of PCA:
1. Image Compression: Reducing the dimensionality of image data while
preserving important visual features.
2. Data Preprocessing: Preparing data for machine learning by removing
noise or redundant information.
3. Exploratory Data Analysis: Visualizing high-dimensional data in 2D or
3D.
4. Feature Engineering: Extracting informative features from high-
dimensional datasets.
Advantages of PCA:
Reduces computational complexity.
Removes redundant features and noise.
Facilitates visualization of complex data.
Limitations of PCA:
Can lose interpretability since principal components are linear
combinations of original variables.
Sensitive to scaling; variables with larger scales dominate if not
standardized.
Assumes linear relationships between features.