dbms-10 marks
dbms-10 marks
dbms-10 marks
Machine learning (ML) is powerful, but it comes with several challenges and
issues that researchers and practitioners need to address. Here are some key
ones:
1. Data-Related Issues
Quality of Data: ML models rely heavily on high-quality data. Incomplete,
noisy, or biased data can lead to poor performance.
Data Scarcity: Some domains lack sufficient labeled data for training robust
models.
2. Overfitting and Underfitting
Overfitting: The model performs well on training data but poorly on unseen
data because it has "memorized" rather than "learned" patterns.
Underfitting: The model is too simplistic to capture the underlying patterns in
the data, leading to poor performance on both training and test data.
3. Computational Challenges
Resource-Intensive Training: Training complex models like deep learning
requires significant computational power, time, and energy.
Scalability: Managing and processing large datasets efficiently can be
challenging.
4. Interpretability and Explainability
Complex ML models (like deep neural networks) are often considered "black
boxes," making it hard to understand how they make decisions.Lack of
interpretability can hinder trust and adoption in critical fields like healthcare or
finance.
5. Ethical and Privacy Concerns
Data Privacy: Collecting and using personal data for training models raises
privacy concerns.
Ethical Use: Misuse of ML, such as for surveillance or spreading
misinformation, poses ethical dilemmas.
6. Generalization
ML models may struggle to generalize to new, unseen data, especially if
the training data does not represent the real-world distribution.
7. Model Deployment and Maintenance
Adaptation to Change: Real-world data can change over time (data drift),
requiring frequent model updates.
Integration: Embedding ML models into existing systems and workflows can be
complex.
8. Reproducibility
Variations in implementation, data preprocessing, or random seeds can make
reproducing ML results difficult.
9. Cost
Developing and deploying ML systems can be expensive due to the need for
skilled personnel, computational resources, and data preparation.
10. Ethical AI Development
Ensuring that models are designed, trained, and used in a way that aligns with
societal values and fairness can be challenging but crucial.
UNIT-2
o Regression Trees
o Non-Linear Regression
o Polynomial Regression
2. Classification
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic
Regression
o With the help of supervised learning, the model can predict the
output on the basis of prior experiences.
o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC
Accuracy
o The accuracy metric is one of the simplest Classification metrics
to implement, and it can be determined as the number of correct
predictions to the total number of predictions.
o It can be formulated as:
Confusion Matrix
A confusion matrix is a tabular representation of prediction
outcomes of any binary classifier, which is used to describe the
performance of the classification model on a set of test data
when true values are known.
Precision
Recall or Sensitivity
It is also similar to the Precision metric; however, it aims to
calculate the proportion of actual positive that was identified
incorrectly. It can be calculated as True Positive or predictions
that are actually true to the total number of positives, either
correctly predicted as positive or incorrectly predicted as
negative
F-Scores
F-score or F1 Score is a metric to evaluate a binary classification model on the
basis of predictions that are made for the positive class.
Model Validation and Testing: Building a machine learning model is not enough
to get the right predictions, as you have to check the accuracy and need to
validate the same to ensure get the precise results. And validating the model
will improve the performance of the ML model.
UNIT – 3
Explain Bayes theorem.
In ML, Bayes' theorem enhances classification and decision-making by
providing accurate predictions based on learned data. It helps ML systems
establish relationships between data and output, enabling revised predictions
that result in more accurate decisions and actions, even with uncertain or
incomplete data. Bayes' theorem can be derived using product rule
and conditional probability of event X with known event Y:
Problem Identification:
The problem needs to be
a well-formed problem,
i.e. a problem with well-defined goals and benefit, which has a long-term
impact.
Identification of Required Data:The required data set that represents the
problem needs to be identified/evaluated. For example: If the problem is to
predict whether a tumor is malignant or benign, then the corresponding
patient data sets related to tumors are to be identified
Data Pre-processing: the data is gathered - raw format and is not ready for
immediate analysis. All the unnecessary/irrelevant data elements are
removed. It ensures, the data is ready to be fed into the ML algorithm.
Definition of Training Data Set: the user should decide what kind of data set is
to be used as a training set. A set of data input (X) and corresponding outputs
(Y) is gathered either from human experts or experiments.For Ex: signature
analysis, the training data set might be a single handwritten alphabet, a
handwritten word or an entire line.
Algorithm Selection: Involves determining the structure of the learning
function and the corresponding learning algorithm. On the basis of various
parameters, the best algorithm for a given problem is chosen.
Training: The learning algorithm identified in the previous step is run on the
training set for further fine tuning.
Evaluation with the Test Data Set
Training data is run on the algorithm, and its performance is measured here. If
a suitable result is not obtained, further training of parameters may be
required
K Nearest Neighbor
The unknown and unlabelled data which comes for a prediction problem is
judged on the basis of the training data set elements which are similar to the
unknown element. So, the class label of the unknown element is assigned on
the basis of the class labels of the similar training data set elements ( can be
considered as neighbours of the unknown element).
Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number
of nearest neighbours to be considered)
Steps:
Do for all test data points:Calculate the distance (usually Euclidean distance) of
the test data point from the different training data points.Find the closest ‘k’
training data points, i.e. training data points whose distances are least from the
test data point.
If k = 1:Then assign class label of the training data point to the test data point
Else:Whichever class label is predominantly present in the training data points,
assign that class label to the test data point
In the kNN algorithm, the class label of the test data elements is decided by the
class label of the training data elements which are neighbouring. the most
common approach adopted by kNN to measure similarity between two data
elements is Euclidean distance.
In the k-NN algorithm, the value of ‘k’ indicates the number of neighbors that
need to be considered.
For example, if the value of k is 3, only three nearest neighbours or three
training data elements closest to the test data element are considered. Out of
the three data elements, the class which is under majority voting is considered
as the class label to be assigned to the test data. In case, the value of k is 1,
only the closest training data element is considered. The class label of that data
element is directly assigned to the test data element.
Strengths of the k-NN algorithm
Extremely simple algorithm – easy to understand Very effective in certain
situations.Very fast or almost no time required for the training phase
Weaknesses of the k-NN algorithm
Does not learn anything in the real sense. Classification is done completely on
the basis of the training data. So, it has a heavy reliance on the training data.
Because there is no model trained in real sense and the classification is done
completely on the basis of the training data, the classification process is very
slow. Also, a large amount of computational space is required to load the
training data for classification.
Decision Tree
Decision tree learning is one of the most widely adopted algorithms for
classification. As the name indicates, it
builds a model in the form of a tree
structure. A decision tree is used for multi-
dimensional analysis with multiple classes. It
is characterized by fast execution time and
ease in the interpretation of the rules. The goal of decision tree learning is to
create a model (based on the past data called past vector) that predicts the
value of the output variable based on the input variables in the feature
vector.Each node (or decision node) of a decision tree corresponds to one of
the feature vector. From every node, there are edges to children, wherein there
is an edge for each of the possible values (or range of values) of the feature
associated with the node. The tree terminates at different leaf nodes (or
terminal nodes) where each leaf node represents a possible value for the
output variable. The output variable is determined by following a path that
starts at the root and is guided by the values of the input variables. A decision
tree is usually represented in the format depicted in Figure. Each internal node
(represented by boxes) tests an attribute (represented as ‘A’/‘B’ within the
boxes). Each branch corresponds to an attribute value (T/F) in the above case.
Each leaf node assigns a classification. The first node is called as ‘Root’ Node.
Branches from the root node are called as ‘Branch’ Nodes where ‘A’ is the
Root Node (first node). ‘B’ is the Branch Node. ‘T’ & ‘F’ are Leaf Nodes.Thus, a
decision tree consists of three types of nodes: Root Node, Branch Node, Leaf
Node
Support Vectors
SVM is a model, which can do linear classification as well as regression.
Primarily, it is used for Classification problems in Machine Learning.In SVM, a
model is built to discriminate the data instances belonging to different
classes.Let us assume for the sake of simplicity that the data instances are
linearly separable. In this case, when mapped in a two- dimensional space, the
data instances belonging to different classes fall in different sides of a straight
line drawn in the two-dimensional space.If the same concept is extended to a
multi dimensional feature space, the straight line dividing data instances
belonging to different classes transforms to a hyperplane as depicted in Figure.
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.SVM chooses the extreme points/vectors that
help in creating the hyperplane. These extreme cases are called as support
vectors, and hence algorithm is termed as Support Vector Machine.
Hyperplane: There can be multiple lines/decision boundaries to segregate the
classes in n-dimensional space, but we need to find out the best decision
boundary that helps to classify the data points. This best boundary is known as
the hyperplane of SVM. The dimensions of the hyperplane depend on the
features present in the dataset, which means if there are 2 features (as shown
in image), then hyperplane will be a straight line. And if there are 3 features,
then hyperplane will be a 2-dimension plane.We always create a hyperplane
that has a maximum margin, which means the maximum distance between the
data points.Support Vectors: The data points or vectors that are the closest to
the hyperplane and which affect the position of the hyperplane are termed as
Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset
has two features x1 and x2. We want a classifier that can classify the pair(x1,
x2) of coordinates in either green or blue. So as it is 2-d space by just using a
straight line, we can easily separate these two classes. But there can be
multiple lines that can separate these classes. Hence, the SVM algorithm helps
to find the best line or decision boundary; this best boundary or region is called
as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the
vectors and the
hyperplane is called as
margin. And the goal of
SVM is to maximize
this margin. The
hyperplane with
maximum margin is
called the optimal
hyperplane.
UNIT-4
Simple Linear Regression
Simple Linear Regression is a type of Regression algorithms that models the
relationship between a dependent variable and a single independent variable.
The relationship shown by a Simple Linear Regression model is linear or a
sloped straight line, hence it is called Simple Linear Regression.
y= a0+a1x+ ε
0= It is the intercept of the
Regression line
a1= It is the slope of the regression
line, which tells whether the line is
increasing or decreasing.
ε = The error term.
Logistic Regression
Logistic regression is another supervised learning algorithm which is used to
solve the classification problems. In classification problems, we have
dependent variables in a binary or discrete format such as 0 or 1.Logistic
regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.It is a predictive analysis algorithm
which works on the concept of probability.Logistic regression is a type of
regression, but it is different from the linear regression algorithm in the term
how they are used.Logistic regression uses sigmoid function or logistic function
which is a complex cost function. This sigmoid function is used to model the
data in logistic regression. The function can be represented as:
o f(x)= Output between the 0 and 1 value.
o x= input to the function
o e= base of natural logarithm.
When we provide the input values (data) to the function, it gives the S-curve as
follows:
It uses the concept of threshold
levels, values above the
threshold level are rounded up
to 1, and values below the
threshold level are rounded up
to 0.
There are three types of logistic
regression:
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Assumptions in logistic regression
The following assumptions must hold when building a logistic regression
model:
There exists a linear relationship between logit function and independent
variables.The dependent variable Y must be categorical (1/0) and take binary
value, e.g. if pass then Y = 1; else Y = 0. The data meets the ‘iid’ criterion, i.e.
the error terms, ε, are independent from one another and identically
distributed The error term follows a binomial distribution [n, p]
n = # of records in the data
p = probability of success (pass, responder)
Maximum Likelihood
Maximum Likelihood Estimation is a method of determining the parameters
(mean, standard deviation, etc) of normally distributed random sample data or
a method of finding the best fitting Probability Density Function over the
random sample data.The term maximum likelihood represents that we are
maximizing the likelihood function, called the Maximization of the Likelihood
Function.The maximum likelihood estimation is a base of some machine
learning and deep learning approaches used for classification problems.
Maximum Likelihood is a function that describes the data points and
their likeliness to the model for best fitting.
Maximum likelihood is different from the probabilistic methods, where
probabilistic methods work on the principle of calculation probabilities.
In contrast, the likelihood method tries o maximize the likelihood of data
observations according to the data distribution.
Maximum likelihood is an approach used for solving the problems like
density distribution and is a base for some algorithms like logistic
regression.
The approach is very similar and is predominantly known as the
perceptron trick in terms of deep learning methods.
We calculate Likelihood based on conditional probabilities.
…,[Xn=xn] ∣ P)=Πi=1nPxi(1−P)1−xi
where,
L -> Likelihood value
F -> Probability distribution function
P -> Probability
X1, X2, … Xn -> random sample of size n taken from the
whole population.
x1, x2, … xn -> values that these random sample (Xi) takes
when determining the PDF.
Π -> product from 1 to n.
Unit – 5
What is Clustering and different types of clustering?
The task of grouping data points based on their similarity with each other is
called Clustering or Cluster Analysis. Clustering aims at forming groups of
homogeneous data points from a heterogeneous dataset. It evaluates the
similarity based on a metric like Euclidean distance, Cosine similarity,
Manhattan distance, etc. and then group the points with highest similarity
score together.
K-Medios
K-Means Clustering
K-Means is one of the most popular clustering algorithms due to its simplicity
and efficiency. It works by partitioning data points into a predefined number of
clusters (denoted by K). The algorithm proceeds as follows:
Choose K initial centroids (either randomly or by other methods like K-
Means++).
Assign each data point to the nearest centroid.
Recalculate the centroids based on the assigned data points.
Repeat the process until convergence (i.e., the centroids no longer
change significantly). K-Means is fast and works well when the clusters
are spherical, but it may struggle with clusters of different shapes or
sizes.
2. Hierarchical Clustering
Hierarchical clustering builds a tree-like
structure called a dendrogram, which
shows how clusters are nested within
each other. There are two main types:
Agglomerative (bottom-up):
Each data point starts as its own
cluster, and the closest clusters
are merged iteratively until all
points belong to a single cluster.
Divisive (top-down): All points are initially in one cluster, which is then
recursively split into smaller clusters. Agglomerative hierarchical
clustering is more commonly used, and the method can be visualized by
cutting the dendrogram at a certain level to define the desired number
of clusters. It does not require the number of clusters to be predefined,
but it can be computationally expensive for large datasets.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a density-based clustering algorithm that groups points based on
their density in the feature space. It works by identifying regions of high point
density, which are considered as clusters. Points in low-density areas are
labeled as noise or outliers. The main parameters in DBSCAN are:
Epsilon (ε): The maximum distance between two points to be considered
neighbors.
MinPts: The minimum number of points required to form a dense region
(a cluster). DBSCAN is robust to outliers and can find clusters of arbitrary
shape, making it more flexible than K-Means. However, it is sensitive to
the choice of parameters.
4. Mean Shift Clustering
Mean Shift is a non-parametric clustering algorithm that works by shifting the
centroid of data points iteratively towards regions of higher density. It does not
require the number of clusters to be specified in advance. The algorithm works
by computing the mean of the points within a window (called the kernel), and
the centroid is moved to this mean. This process continues until the centroids
converge. It can detect clusters of arbitrary shapes and is less sensitive to
outliers than K-Means, but it can be computationally expensive.
5. Gaussian Mixture Models (GMM)
Gaussian Mixture Models are a probabilistic clustering technique based on the
assumption that the data is generated from a mixture of several Gaussian
distributions. The model estimates the parameters of the distributions (mean,
covariance, and weight) using the Expectation-Maximization (EM) algorithm.
GMM is more flexible than K-Means because it allows for clusters with different
shapes and densities. It is particularly useful when the data exhibits a
probabilistic distribution.
6. Affinity Propagation
Affinity Propagation is a clustering algorithm that does not require the number
of clusters to be specified. It works by exchanging messages between data
points to find "exemplars" (representative points) that best describe the
clusters. The algorithm uses similarity between points to define clusters, and
each point sends and receives messages iteratively. It tends to be more
computationally expensive than K-Means but can produce better results for
certain types of data.
Applications of Clustering
o In Identification of Cancer Cells
o In Search Engines
o Customer Segmentation
o In Biology
o In Land Use
Apriori Algorithm
To improve the efficiency of level-wise generation of frequent itemsets, an
important property is used called Apriori property which helps by reducing the
search space.
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of
Apriori algorithm is its anti-monotonicity of support measure. Apriori assumes
that
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Steps for Apriori Algorithm
Below are the steps for the apriori algorithm:
Step-2: Take all supports in the transaction with higher support value than
the minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value
than the threshold or minimum confidence.