Module 1 Data Mining
Module 1 Data Mining
Contents
IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department
Module 1
Course: IM 3 Fundamentals of Data Warehousing and Data Mining
Unit No. 1
Topic: Introduction
Score:
Name:
Year & Section:
Date:
In this chapter, a brief introduction to Data Mining is outlined. The discussion includes the definitions
of Data Mining; stages identified in Data Mining Process, Models, and it also address the brief
description on Data Mining methods and some of the applications and examples of Data Mining.
Learning Objectives:
At the end of the lesson, you should be able to:
1. Explain the fundamental principles of data mining
2. Discuss the evolving role of data mining for several application areas and industry
3. Justify potential use and or application of data mining in unexplored areas
CONTENT
Data Mining is defined as extracting information from huge sets of data. In other words, we can
say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications.
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department
1.1.2. Need for turning data into knowledge – Drowning in data, but starving for knowledge
Market Analysis
Fraud Detection
Customer Retention
Production Control
Scientific Exploration
“Gold Mining from rock or sand” is same as “Knowledge mining from data”
Other terms for Data Mining:
Knowledge Mining
Knowledge Extraction
Pattern Analysis
Data Archeology
Data Dredging
IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department
There are several major data mining techniques have been developing and using in data mining
projects recently including:
Association;
Classification;
Clustering;
Prediction;
sequential patterns; and
decision tree.
We will briefly examine those data mining techniques in the following sections.
Association:
Association is one of the best-known data mining technique. In association, a pattern is
discovered based on a relationship between items in the same transaction. That’s is the reason
why association technique is also known as relation technique. The association technique is used
in market basket analysis to identify a set of products that customers frequently purchase together.
Retailers are using association technique to research customer’s buying habits. Based on historical
sale data, retailers might find out that customers always buy crisps when they buy beers, and,
therefore, they can put beers and crisps next to each other to save time for the customer and
increase sales.
Classification
Clustering
Clustering is a data mining technique that makes a meaningful or useful cluster of objects which
have similar characteristics using the automatic technique. The clustering technique defines the
classes and puts objects in each class, while in the classification techniques, objects are assigned
into predefined classes. To make the concept clearer, we can take book management in the
library as an example. In a library, there is a wide range of books on various topics available. The
challenge is how to keep those books in a way that readers can take several books on a particular
topic without hassle. By using the clustering technique, we can keep books that have some kinds
of similarities in one cluster or one shelf and label it with a meaningful name. If readers want to
grab books in that topic, they would only have to go to that shelf instead of looking for the entire
library.
Prediction
The prediction, as its name implied, is one of a data mining techniques that discovers the
relationship between independent variables and relationship between dependent and
independent variables. For instance, the prediction analysis technique can be used in the sale to
predict profit for the future if we consider the sale is an independent variable, profit could be a
dependent variable. Then based on the historical sale and profit data, we can draw a fitted
regression curve that is used for profit prediction.
IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department
Sequential Patterns
Sequential patterns analysis is one of data mining technique that seeks to discover or identify
similar patterns, regular events or trends in transaction data over a business period.
In sales, with historical transaction data, businesses can identify a set of items that customers buy
together different times in a year. Then businesses can use this information to recommend
customers buy it with better deals based on their purchasing frequency in the past.
Decision trees
The A decision tree is one of the most commonly used data mining techniques because its model
is easy to understand for users. In decision tree technique, the root of the decision tree is a simple
question or condition that has multiple answers. Each answer then leads to a set of questions or
conditions that help us determine the data so that we can make the final decision based on it.
For example, We use the following decision tree to determine whether or not to play tennis:
Starting at the root node, if the outlook is overcast then we should definitely play tennis. If it is rainy,
we should only play tennis if the wind is the week. And if it is sunny then we should play tennis in
case the humidity is normal.
We often combine two or more of those data mining techniques together to form an appropriate
process that meets the business needs.
1. Classification analysis
This analysis is used to retrieve important and relevant information about data, and metadata. It
is used to classify different data in different classes. Classification is similar to clustering in a way
that it also segments data records into different segments called classes. But unlike clustering, here
the data analysts would have the knowledge of different classes or cluster. So, in classification
analysis you would apply algorithms to decide how new data should be classified. A classic
example of classification analysis would be our outlook email. In outlook, they use certain
algorithms to characterize an email as legitimate or spam.
2. Association rule learning
It refers to the method that can help you identify some interesting relations (dependency
modeling) between different variables in large databases. This technique can help you unpack
some hidden patterns in the data that can be used to identify variables within the data and the
concurrence of different variables that appear very frequently in the dataset. Association rules
are useful for examining and forecasting customer behavior. It is highly recommended in the retail
industry analysis. This technique is used to determine shopping basket data analysis, product
clustering, catalog design and store layout. In it, programmers use association rules to build
programs capable of machine learning.
IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department
This refers to the observation for data items in a dataset that do not match an expected pattern
or an expected behavior. Anomalies are also known as outliers, novelties, noise, deviations and
exceptions. Often they provide critical and actionable information. An anomaly is an item that
deviates considerably from the common average within a dataset or a combination of data.
These types of items are statistically aloof as compared to the rest of the data and hence, it
indicates that something out of the ordinary has happened and requires additional attention. This
technique can be used in a variety of domains, such as intrusion detection, system health
monitoring, fraud detection, fault detection, event detection in sensor networks, and detecting
eco-system disturbances. Analysts often remove the anomalous data from the dataset top
discover results with an increased accuracy.
4. Clustering analysis
The cluster is actually a collection of data objects; those objects are similar within the same cluster.
That means the objects are similar to one another within the same group and they are rather
different or they are dissimilar or unrelated to the objects in other groups or in other clusters.
Clustering analysis is the process of discovering groups and clusters in the data in such a way that
the degree of association between two objects is highest if they belong to the same group and
lowest otherwise. A result of this analysis can be used to create customer profiling.
5. Regression analysis
In statistical terms, a regression analysis is the process of identifying and analyzing the relationship
among variables. It can help you understand the characteristic value of the dependent variable
changes, if any one of the independent variables is varied. This means one variable is dependent
on another, but it is not vice versa.it is generally used for prediction and forecasting.
All of these techniques can help analyze different data from different perspectives. Now you have
the knowledge to decide the best technique to summarize data into useful information –
information that can be used to solve a variety of business problems to increase revenue,
customer satisfaction, or decrease unwanted cost.
1. Classification: This analysis is used to retrieve important and relevant information about
data, and metadata. This data mining method helps to classify data in different classes.
2. Clustering: Clustering analysis is a data mining technique to identify data that are like each
other. This process helps to understand the differences and similarities between the data.
3. Regression: Regression analysis is the data mining method of identifying and analyzing the
relationship between variables. It is used to identify the likelihood of a specific variable,
given the presence of other variables.
4. Association Rules: This data mining technique helps to find the association between two
or more Items. It discovers a hidden pattern in the data set.
5. Outer detection: This type of data mining technique refers to observation of data items in
the dataset which do not match an expected pattern or expected behavior. This
technique can be used in a variety of domains, such as intrusion, detection, fraud or fault
detection, etc. Outer detection is also called Outlier Analysis or Outlier mining.
6. Sequential Patterns: This data mining technique helps to discover or identify similar patterns
or trends in transaction data for certain period.
IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department
7. Prediction: Prediction has used a combination of the other data mining techniques like
trends, sequential patterns, clustering, classification, etc. It analyzes past events or
instances in a right sequence for predicting a future event.
1. Tracking patterns. One of the most basic techniques in data mining is learning to recognize
patterns in your data sets. This is usually a recognition of some aberration in your data
happening at regular intervals, or an ebb and flow of a certain variable over time. For
example, you might see that your sales of a certain product seem to spike just before the
holidays, or notice that warmer weather drives more people to your website.
2. Classification. Classification is a more complex data mining technique that forces you to
collect various attributes together into discernable categories, which you can then use to
draw further conclusions, or serve some function. For example, if you’re evaluating data
on individual customers’ financial backgrounds and purchase histories, you might be able
to classify them as “low,” “medium,” or “high” credit risks. You could then use these
classifications to learn even more about those customers.
3. Association. Association is related to tracking patterns, but is more specific to dependently
linked variables. In this case, you’ll look for specific events or attributes that are highly
correlated with another event or attribute; for example, you might notice that when your
customers buy a specific item, they also often buy a second, related item. This is usually
what’s used to populate “people also bought” sections of online stores.
4. Outlier detection. In many cases, simply recognizing the overarching pattern can’t give
you a clear understanding of your data set. You also need to be able to identify anomalies,
or outliers in your data. For example, if your purchasers are almost exclusively male, but
during one strange week in July, there’s a huge spike in female purchasers, you’ll want to
investigate the spike and see what drove it, so you can either replicate it or better
understand your audience in the process.
5. Clustering. Clustering is very similar to classification, but involves grouping chunks of data
together based on their similarities. For example, you might choose to cluster different
demographics of your audience into different packets based on how much disposable
income they have, or how often they tend to shop at your store.
6. Regression. Regression, used primarily as a form of planning and modeling, is used to
identify the likelihood of a certain variable, given the presence of other variables. For
example, you could use it to project a certain price, based on other factors like availability,
consumer demand, and competition. More specifically, regression’s main focus is to help
you uncover the exact relationship between two (or more) variables in a given data set.
7. Prediction. Prediction is one of the most valuable data mining techniques, since it’s used
to project the types of data you’ll see in the future. In many cases, just recognizing and
understanding historical trends is enough to chart a somewhat accurate prediction of
what will happen in the future. For example, you might review consumers’ credit histories
and past purchases to predict whether they’ll be a credit risk in the future.
IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department
If the data set is not diverse, data mining results may not be accurate.
Integration information needed from heterogeneous databases and global information
systems could be complex
Performance Issues:
Efficiency and Scalability of Data Mining Algorithms.
Parallel, distributed and incremental mining algorithms.
Supplemental Material:
https://youtube.be/grRwJ5jZBog
IM 3
“Shine and Serve with Honor and Excellence.”
Romblon State University
San Fernando, Romblon
Technology Education Department
ASSESSMENT TASK
A. Create a Timeline
1. Read the article on "Weather Prediction Problem".
2. Create a timeline / evolution of data mining methods used in weather
prediction problem.
3. Use GoogleScholar to search for articles on weather prediction data mining
techniques.
4. Review the developments in weather prediction.
5. Present the historical developments in class.
B. Essay
1. Explain the potential application of data mining in unexplored areas such as
Natural resource mining
Cultural or historical artifacts
Marine biodiversity
2. If these areas can be explored using data mining techniques, what are the
potential outputs and outcomes from using Data Mining Techniques? Justify
your answer.
General Instructions:
1. Accomplish the quiz individually.
2. Submit your answer by taking clear pictures of your answers and send it to your
teacher through Facebook Messenger, and Gmail.
3. Submit your answer on/ before February 12, 2021 (11:55 pm, Philippine Standard Time).
References:
Anderberg, M. R., Cluster Analysis for Applications, New York:
Academic Press, 1973.
Chambers, J. M., Computational Methods for Data Analysis, New York:
John Wiley & Sons, 1977
Cleveland, W. S., Dynamic Graphics for Statistics,
Wadworth and Brooks/Cole, 1988.
Proofread by:
Validated by:
Approved by: