TOD 212 - Digging Through Data - PPT - For Students - Monsoon 2023 (Autosaved)
TOD 212 - Digging Through Data - PPT - For Students - Monsoon 2023 (Autosaved)
TOD 212 - Digging Through Data - PPT - For Students - Monsoon 2023 (Autosaved)
Monsoon 2023
TOD212- Decision Sciences
Data mining in general terms means mining or digging deep into data that is in
different forms to gain patterns, and to gain knowledge on that pattern.
In the process of data mining, large data sets are first sorted, then patterns are
identified and relationships are established to perform data analysis and solve
problems.
What is Data Mining?
• Data mining is an automatic or semi-automatic technical process
that analyses large amounts of scattered information to make sense of it
and turn it into knowledge.
• With the joint action of analytics and data mining, which combines statistics,
artificial Intelligence, and automatic learning, companies can create models
to discover connections between millions of records.
Some Common approaches in Data Mining
• Cluster Analysis
• Classification
• Association
• Cause-and-effect modeling
Some Common
approaches in
Data Mining
• Understanding –
Group related documents
for browsing, group
genes and proteins that
have similar functionality,
or group stocks with
similar price fluctuations
• Summarization –
Reduce the size of large
data set
Classification:
• Classification methods seek to classify a categorical outcome
into two or more categories based on various data attributes.
• For each record in a database, we have a categorical
variable of interest and several additional predictor
variables.
• For a given set of predictor variables, we would like to assign
the best value of the categorical variable.
Classification Techniques:
• Two different data mining approaches used for classification:
• Typically, various values of k are used, and then results are inspected to determine
which is best.
• There is no particular way to determine the best value for "𝑘", so we need to try
some values to find the best out of them. The most preferred value for 𝑘 is 5.
• A meager value for 𝑘, such as k = 1 or k = 2, can be noisy and lead to the effects
of outliers in the model.
• Large values for 𝑘 are good, but they may find some difficulties.
Using k-NN for Classifying Credit-Approval Decisions
• Credit Approval Decisions Classification Data
• Consider the first new record, 51. If k = 1, the record having the minimum
distance from record 51 is record 27. Since the credit decision was to
approve, we would classify record 51 as an approval.
Discriminant Analysis
• Discriminant analysis is another classification method.
• It is a technique for classifying a set of observations into predefined
classes.
• The purpose is to determine the class of an observation based on a set
of predictor variables.
• e.g. classifying applications for loans, credit cards, and insurance into
low- and high-risk categories
• With only two classification groups, we can apply regression
analysis. Unfortunately, when there are more than two, linear
regression cannot be applied, and special software must be used.
Classifying Credit Decisions Using Discriminant Analysis
• For the credit-approval data, model the decision (approve or reject) as a function of
the other variables. Use the following regression model, where Y represents the
decision (0 or 1):