Unit 4 Introduction to Algorithm
Unit 4 Introduction to Algorithm
Introduction to Algorithm
Classification of Algorithm:-
By Learning Paradigm
1. Supervised Learning
Supervised learning algorithms learn from labeled data. They are used for tasks where the goal is
to predict an output variable from input variables.
2. Unsupervised Learning
Unsupervised learning algorithms learn from unlabeled data. They are used to find hidden patterns
or intrinsic structures in the input data.
3. Semi-Supervised Learning
Semi-supervised learning algorithms learn from a mix of labeled and unlabeled data. They are used
when labeling data is expensive or time-consuming.
4. Reinforcement Learning
By Type of Task
1. Classification Algorithms
Examples: Logistic Regression, Decision Trees, Random Forests, Support Vector Machines
(SVM), K-Nearest Neighbors (KNN), Naive Bayes, Neural Networks.
2. Regression Algorithms
3. Clustering Algorithms
By Model Type
1. Linear Models
Assume a linear relationship between input variables and the output variable.
2. Non-Linear Models
Examples: Decision Trees, Neural Networks, Kernel SVM, K-Nearest Neighbors (KNN).
3. Ensemble Models
4. Probabilistic Models
By Training Style
1. Batch Learning
2. Online Learning
3. Transfer Learning
Pre-trained models on one task are reused and fine-tuned for a different but related task.
These classifications help in understanding the various machine learning algorithms and their
appropriate use cases. Each algorithm has its strengths and weaknesses, making it suitable for specific
types of problems.
What is clustering?
Clustering is an unsupervised machine learning technique used to group similar data points
into clusters or groups. The primary goal of clustering is to identify natural groupings within a dataset
such that data points within the same cluster are more similar to each other than to those in other
clusters. Clustering is widely used in various fields such as market research, pattern recognition, image
processing, and bioinformatics.
Clustering algorithms often rely on distance measures to determine the similarity between data points.
Common distance measures include:
Euclidean Distance: The straight-line distance between two points in Euclidean space.
Manhattan Distance: The sum of the absolute differences of the coordinates.
Cosine Similarity: Measures the cosine of the angle between two vectors, indicating their orientation
similarity.
2. Types of Clustering
There are several types of clustering techniques, each with its own methodology:
1. Partitioning Clustering
Divides the data into non-overlapping subsets (clusters) such that each data point belongs to exactly
one subset.
K-Means Clustering: Divides the data into kkk clusters by minimizing the sum of squared distances
between data points and the corresponding cluster centroids.
K-Medoids (PAM): Similar to K-Means, but uses actual data points (medoids) as cluster centers to
reduce sensitivity to outliers.
2. Hierarchical Clustering
Builds a hierarchy of clusters either by merging smaller clusters into larger ones (agglomerative) or by
splitting larger clusters into smaller ones (divisive).
Agglomerative Clustering: Starts with each data point as a single cluster and iteratively merges the
closest pairs of clusters.
Divisive Clustering: Starts with all data points in one cluster and recursively splits them into smaller
clusters.
3. Density-Based Clustering
Forms clusters based on areas of high density separated by areas of low density.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Identifies clusters as dense
regions of data points and marks points in low-density regions as noise.
OPTICS (Ordering Points To Identify the Clustering Structure): An extension of DBSCAN that handles
varying densities.
4. Model-Based Clustering
Assumes that the data is generated by a mixture of underlying probability distributions and clusters are
identified based on these distributions.
Gaussian Mixture Models (GMM): Assumes data is generated from a mixture of Gaussian distributions
with unknown parameters.
5. Grid-Based Clustering
Quantizes the data space into a finite number of cells and forms clusters based on the density of data
points in these cells.
STING (Statistical Information Grid): Divides the data space into hierarchical grid structures and
performs clustering at different levels of resolution.
Applications of Clustering
Market Segmentation: Identifying distinct customer groups based on purchasing behavior.
Image Segmentation: Grouping pixels in an image based on color or intensity to identify objects or
regions.
Document Clustering: Grouping similar documents together for information retrieval or topic
modeling.
Anomaly Detection: Identifying unusual patterns or outliers in data.
Example of K-Means Clustering in Python
Here's a simple example of using K-Means clustering with the scikit-learn library:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Sample data
data = {
'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Feature2': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
}
# Create DataFrame
df = pd.DataFrame(data)
# K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(df)
# Cluster centers
centroids = kmeans.cluster_centers_
In this example:
This demonstrates how clustering can be used to find natural groupings in data and visualize the
results.
Types of clustering :-
Clustering, an unsupervised machine learning technique, involves grouping data points into clusters based
on their similarities. There are several types of clustering methods, each with unique approaches and use
cases. Here are the main types:
1. Partitioning Clustering
Partitioning methods divide the data into non-overlapping subsets (clusters) where each data point belongs
to exactly one cluster.
K-Means Clustering
Description: K-Means aims to partition the data into kkk clusters, with each data point belonging to
the cluster with the nearest mean (centroid).
Algorithm:
1. Initialize kkk centroids randomly.
2. Assign each data point to the nearest centroid.
3. Update centroids by calculating the mean of all points assigned to each centroid.
4. Repeat steps 2 and 3 until convergence (centroids no longer change).
Pros: Simple and fast, works well with large datasets.
Cons: Sensitive to the initial placement of centroids, requires kkk to be specified, sensitive to
outliers.
K-Medoids (PAM)
Description: Similar to K-Means but uses actual data points (medoids) as cluster centers to reduce
sensitivity to outliers.
Algorithm:
1. Initialize kkk medoids randomly.
2. Assign each data point to the nearest medoid.
3. Update medoids by minimizing the sum of dissimilarities between points and their medoids.
4. Repeat steps 2 and 3 until convergence.
Pros: More robust to outliers than K-Means.
Cons: Computationally more expensive than K-Means.
2. Hierarchical Clustering
Hierarchical methods build a hierarchy of clusters either by merging smaller clusters into larger ones
(agglomerative) or by splitting larger clusters into smaller ones (divisive).
Agglomerative Clustering
Description: Starts with each data point as a single cluster and iteratively merges the closest pairs
of clusters.
Algorithm:
1. Compute the distance (similarity) matrix for all data points.
2. Merge the closest pair of clusters.
3. Update the distance matrix to reflect the merged clusters.
4. Repeat steps 2 and 3 until all points are in a single cluster.
Pros: Does not require the number of clusters to be specified in advance, produces a dendrogram for
visualizing the hierarchy.
Cons: Computationally expensive for large datasets, sensitive to noise and outliers.
Divisive Clustering
Description: Starts with all data points in one cluster and recursively splits them into smaller
clusters.
Algorithm:
1. Treat the entire dataset as a single cluster.
2. Split the cluster into two sub-clusters.
3. Repeat step 2 for each sub-cluster until each data point is in its own cluster.
Pros: Produces a hierarchy of clusters, useful for identifying sub-cluster structures.
Cons: Computationally expensive, sensitive to noise and outliers.
3. Density-Based Clustering
Density-based methods form clusters based on areas of high density separated by areas of low density.
Description: Identifies clusters as dense regions of data points and marks points in low-density
regions as noise.
Algorithm:
1. Initialize a random point and mark it as visited.
2. Expand the cluster by including all points within a specified radius (ϵ\epsilonϵ) that have a
minimum number of neighbors (MinPts).
3. Repeat steps 1 and 2 for each unvisited point.
Pros: Can find arbitrarily shaped clusters, robust to outliers, does not require the number of clusters
to be specified.
Cons: Sensitive to the choice of ϵ\epsilonϵ and MinPts, struggles with varying densities.
4. Model-Based Clustering
Model-based methods assume that the data is generated by a mixture of underlying probability distributions
and identify clusters based on these distributions.
Description: Assumes data is generated from a mixture of Gaussian distributions with unknown
parameters.
Algorithm:
1. Initialize parameters using methods like K-Means.
2. Expectation step: Calculate the probability of each data point belonging to each Gaussian
component.
3. Maximization step: Update the parameters to maximize the likelihood of the data.
4. Repeat steps 2 and 3 until convergence.
Pros: Can model complex cluster shapes, provides probabilistic cluster memberships.
Cons: Sensitive to initialization, requires specifying the number of clusters.
5. Grid-Based Clustering
Grid-based methods quantize the data space into a finite number of cells and form clusters based on the
density of data points in these cells.
Description: Divides the data space into hierarchical grid structures and performs clustering at
different levels of resolution.
Algorithm:
1. Divide the data space into a hierarchical grid.
2. Calculate statistical information for each cell.
3. Merge cells based on the statistical information to form clusters.
4. Refine clusters by merging cells at higher levels of the hierarchy.
Pros: Efficient for large datasets, can handle high-dimensional data.
Cons: Sensitive to the choice of grid size, may produce clusters of arbitrary shapes.
6. Spectral Clustering
Spectral clustering methods use the eigenvalues (spectrum) of the similarity matrix of the data to perform
dimensionality reduction before clustering in fewer dimensions.
Description: Uses graph theory to partition data points based on their pairwise similarities.
Algorithm:
1. Construct a similarity matrix.
2. Compute the Laplacian matrix.
3. Compute the eigenvalues and eigenvectors of the Laplacian matrix.
4. Use the eigenvectors to cluster the data points.
Pros: Can handle complex cluster shapes, effective for non-convex clusters.
Cons: Computationally expensive, requires setting parameters for the similarity graph.
Each clustering method has its strengths and weaknesses, making them suitable for different types of
datasets and clustering requirements. Choosing the right clustering algorithm depends on the specific
characteristics of the data and the desired outcomes.
Logistic regression is a popular supervised learning algorithm used for binary classification tasks in
machine learning. It predicts the probability of a binary outcome (two possible classes) based on one or
more input features. Despite its name, logistic regression is a classification algorithm rather than a
regression algorithm.
1. Sigmoid Function
The core of logistic regression is the sigmoid function, which maps any real-valued number into the [0, 1]
range. The sigmoid function is defined as:
where zzz is the linear combination of input features. This function outputs the probability of the instance
belonging to the positive class.
2. Model Representation
In logistic regression, the probability that a given input xxx belongs to the positive class (label 1) is
modeled as:
where:
3. Decision Boundary
The decision boundary is a threshold applied to the output probability to determine the class label.
Typically, a threshold of 0.5 is used:
4. Loss Function
The loss function used in logistic regression is the binary cross-entropy (log loss). It measures the
difference between the predicted probabilities and the actual labels:
where mmm is the number of training examples, yiy_iyi is the actual label, and y^i\hat{y}_iy^i is the
predicted probability.
5. Optimization
The model parameters www are optimized by minimizing the loss function using techniques such as
gradient descent. The gradient descent algorithm iteratively updates the weights to minimize the loss
function:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix,
classification_report
# Sample data
data = {
'Feature1': [2, 3, 5, 7, 1, 6, 4, 8],
'Feature2': [1, 5, 8, 3, 4, 7, 2, 6],
'Label': [0, 1, 1, 0, 0, 1, 0, 1]
}
# Create DataFrame
df = pd.DataFrame(data)
# Predictions
y_pred = model.predict(X_test)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)
1. Data Preparation:
o A sample dataset is created with two features and a binary label.
2. Feature and Target Selection:
o The features (X) and target (y) are separated.
3. Data Splitting:
o The data is split into training and test sets using a 75-25 split.
4. Model Creation and Training:
o A LogisticRegression model is instantiated and trained using the training data.
5. Predictions:
o Predictions are made on the test set.
6. Model Evaluation:
o The accuracy, confusion matrix, and classification report are calculated to evaluate the
model's performance.
Logistic regression is a fundamental and widely used algorithm in machine learning, known for its
simplicity, interpretability, and effectiveness in binary classification tasks.