Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

TOD 212 - Digging Through Data - PPT - For Students - Monsoon 2023 (Autosaved)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Ahmedabad University

Monsoon 2023
TOD212- Decision Sciences

Digging through Data


Digging through Data
Data Classification
Data classification is the process of separating and organizing data into
relevant groups (“classes”) based on their shared characteristics.

For example, classification models can be used to determine whether a


customer is likely to purchase more items or not. If the classification model
predicts a greater likelihood that they are about to make more purchases,
then you might want to send them promotional offers and discounts
accordingly.
Data Mining
Knowledge
Discovery
in Data (KDD)

Data mining in general terms means mining or digging deep into data that is in
different forms to gain patterns, and to gain knowledge on that pattern.

In the process of data mining, large data sets are first sorted, then patterns are
identified and relationships are established to perform data analysis and solve
problems.
What is Data Mining?
• Data mining is an automatic or semi-automatic technical process
that analyses large amounts of scattered information to make sense of it
and turn it into knowledge.

• It looks for anomalies, patterns, or correlations among millions of records to


predict results.

• Information continues to grow and grow. In this context, data mining is a


strategic practice considered important by almost 80% of organizations.

• With the joint action of analytics and data mining, which combines statistics,
artificial Intelligence, and automatic learning, companies can create models
to discover connections between millions of records.
Some Common approaches in Data Mining
• Cluster Analysis

• Classification

• Association

• Cause-and-effect modeling
Some Common
approaches in
Data Mining
• Understanding –
Group related documents
for browsing, group
genes and proteins that
have similar functionality,
or group stocks with
similar price fluctuations

• Summarization –
Reduce the size of large
data set
Classification:
• Classification methods seek to classify a categorical outcome
into two or more categories based on various data attributes.
• For each record in a database, we have a categorical
variable of interest and several additional predictor
variables.
• For a given set of predictor variables, we would like to assign
the best value of the categorical variable.
Classification Techniques:
• Two different data mining approaches used for classification:

• k-nearest neighbors • Discriminant analysis


• K-NN algorithm assumes the similarity
between the new case/data and available • Uses predefined classes
cases and puts the new case into the based on a set of linear
category that is most similar to the available discriminant functions of the
categories. predictor variables
• K-NN algorithm stores all the available data
and classifies a new data point based on
the similarity. When new data appears, it
can be easily classified into a well-suited
category using the K-NN algorithm.
k-nearest neighbors
• k-NN algorithm
• Finds records in a database
with similar numerical values of
a set of predictor variables.

• The k-nearest neighbors (k-NN)


algorithm is a classification
scheme that attempts to find
records in a database similar
to the one we wish to classify.
Suppose there are two categories, i.e., Category A and Category B,
• The Similarity is based on the and we have a new data point x1, so that this data point will lie in
“closeness” of a record to which of these categories.
numerical predictors in the
other records, using To solve this type of problem, we need a K-NN algorithm. With the help
normalized Euclidean distances. of K-NN, we can quickly identify the category or class of a particular
dataset.
•Firstly, we will
choose the number
of neighbors, so we
will choose the k=5.
•Next, we will
calculate
the Euclidean
distance between
the data points. The
Euclidean distance is
the distance between
two points, which we
have already studied
in geometry. It can
be calculated as:

By calculating the Euclidean distance we got the nearest neighbors, as three


nearest neighbors in category A and two nearest neighbors in category B.
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each
category.
Step-5: Assign the new data points to that category for which the number of
the neighbor is maximum.
Step-6: Our model is ready.
k-Nearest Neighbor Rules
• The nearest neighbor to a record is the one that has the smallest distance from it.
– If 𝑘 = 1, then the 1-N N rule classifies a record in the same category as its
nearest neighbor.
– k-NN rule finds the k-Nearest Neighbors to each record we want to classify and
then assigns the classification as the classification of most of the k nearest
neighbors.

• Typically, various values of k are used, and then results are inspected to determine
which is best.
• There is no particular way to determine the best value for "𝑘", so we need to try
some values to find the best out of them. The most preferred value for 𝑘 is 5.
• A meager value for 𝑘, such as k = 1 or k = 2, can be noisy and lead to the effects
of outliers in the model.
• Large values for 𝑘 are good, but they may find some difficulties.
Using k-NN for Classifying Credit-Approval Decisions
• Credit Approval Decisions Classification Data
• Consider the first new record, 51. If k = 1, the record having the minimum
distance from record 51 is record 27. Since the credit decision was to
approve, we would classify record 51 as an approval.
Discriminant Analysis
• Discriminant analysis is another classification method.
• It is a technique for classifying a set of observations into predefined
classes.
• The purpose is to determine the class of an observation based on a set
of predictor variables.

• e.g. classifying applications for loans, credit cards, and insurance into
low- and high-risk categories
• With only two classification groups, we can apply regression
analysis. Unfortunately, when there are more than two, linear
regression cannot be applied, and special software must be used.
Classifying Credit Decisions Using Discriminant Analysis
• For the credit-approval data, model the decision (approve or reject) as a function of
the other variables. Use the following regression model, where Y represents the
decision (0 or 1):

• 𝑌 = 𝑏0 + 𝑏1 × Homeowner)+(𝑏2 × Credit Score)+(𝑏3 × 𝑌𝑒𝑎𝑟𝑠 𝑜𝑓 𝐶𝑟𝑒𝑑𝑖𝑡 𝐻𝑜𝑠𝑡𝑜𝑟𝑦 +


𝑏4 × 𝑅𝑒𝑣𝑜𝑙𝑣𝑖𝑛𝑔 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 + (𝑏5 × 𝑅𝑒𝑣𝑜𝑙𝑣𝑖𝑛𝑔 𝑈𝑡𝑖𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛)
• The estimated value of the decision variable is called a discriminant score.
Discriminant Analysis

• The estimated regression function is


• 𝑌 = 0.567 + (0.149 × Homeowner)+(0.000465 × Credit Score)+(0.00420 ×
Discriminant Analysis
Rule for classifying observations using the discriminant scores:
• Compute a cut-off value so that if a discriminant score is less than or
equal to it, the observation is assigned to one group; otherwise, it is
assigned to the other group.
• One simple way is to use the midpoint of the average discriminant
scores:
• Cut-Off Value = (0.9083 + 0.0781)/2 = 0.4932

You might also like