Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Er. Nawaraj Bhandari
Data Warehouse/Data Mining
Chapter 1:
Introduction to Data Mining and Data
Warehousing
Course Title: Data Warehousing and Data Mining
(BSC-CSIT, TU)
Course no: CSC-459
Credit hours: 3
Nature of course: Theory (3 Hrs.) + Lab (3 Hrs.)
Full Marks: 60+20+20
Pass Marks: 24+8+8
Prerequisite: C, Data Structure, Database
Course Overview
Text Books
Reference Books
Types of database processing
• OLTP - On-line transaction processing.
- It is a class of program that facilitates and manages
transaction-oriented applications.
- It is used for supporting daily business.
• OLAP - On-line analytical processing
- It is a way of viewing data in a multidimensional
format.
- It is used for supporting decision making.
Transform “Data” into “Information”
 Data Warehouse provides a multidimensional view of an organization’s
operational (OLTP) data to help user make more informed, fast
decisions.
OLTP VS Data Warehouses(OLAP)
Property OLTP Data Warehouse
Nature of Data Warehouses 3NF Multidimensional
Indexes Few Many
Joins Many Some
Duplicate data Normalized Denormalized
Aggregate data Rare Common
Nature of queries Mostly simple Mostly complex
Updates All the time Not allowed, only refreshed
Historical data Often not available Essential
Stock taking and reordering database
Customer Records database
Internet and
VPN or WAN
LAN
On-line shopping
Webserver and database for
On line shopping
OLTP for point of salesPoint of SaleCustomer with loyalty card
Supermarket Systems
Activity – Identify the Types of Data been
Collected and Used here?
And… What Benefits from Bringing this
Data Together? - 1
And… What Benefits from Bringing this
Data Together?
Sales Trends
Customer Buying habits
Regional variations
Variations by time
Goods generating
profit
Data Warehouse
• Subject-oriented
• Integrated
• Time-variant
• Non-volatile
What is a Data Warehouse?
Subject Orientation
Data warehouse
supplier
customer
product
A subject orientation
buying
A data warehouse can be use to analyse a particular subject area.
Integration
OLTP System Data warehouse
App1-m,f
App2-1,0
App3-male,female
Integration
Date(ddmmyy)
App1-date(yymmdd)
App2-date(mmddyy)
App3-date(ddmmyy)
m,f
Integration
Data warehouse have integrated data from multiple data source. For example data source A and
data source B may have different ways of identifying product. But in data warehouse there will be
Only a single way of identifying a product.
Time Variant
OLTP System Data warehouse
• time horizon – 60-90 days
depending on business
• key will not usually have an
element of time
• data can be changed
• time horizon – long term
5-10 years
• key will contain an
element of time
• data cannot be changed
All data in the data warehouse is identified with a
particular time period. For example, a transaction system may hold the most recent address
of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-Volatile
Operational System Data warehouse
create update
retrievedelete
load
access
access
…
Data is stable in a data warehouse. More data is added
but data is never removed. This enables management to
gain a consistent picture of the business
Functionalities of Data Warehouse
Data warehouse is characterized by a relatively low volume of
transaction, queries are often very complex and involve
aggregations.
The basic operations in OLAP are:
1. Roll-Up(Consolidation)
2. Drill-down
3. Slicing
4. Dicing
5. Pivot
Roll-Up(Consolidation)
It performs aggregation on a data cubes in following ways.
• Data is summarized with increased generalization.
• By climbing up a concept hierarchy for a dimensions.
• By dimension reduction
Roll-Up(Consolidation)
Drill-Down
It is reverse of roll-up: It is performed either by following ways.
• By stepping down the concept hierarchy for a dimensions.
• By introducing a new domain.
Drill-Down
Slice
The slice operation selects one particular dimension from a given
cube and provides a new sub-cube. Consider the following
diagram that shows how slice works.
Slice
Dice
Dice selects two or more dimensions from a given cube and
provides a new sub-cube. Consider the following diagram that
shows the dice operation.
Dice
Pivot
The pivot operation is also known as rotation. It rotates the data
axes in view in order to provide an alternative presentation of
data. Consider the following diagram that shows the pivot
operation.
Pivot
Overview of the KDD Process
• The term Knowledge Discovery in Databases, or KDD for
short, refers to the broad process of finding knowledge in
data, and emphasizes the "high-level" application of
particular data mining methods.
• It is of interest to researchers in machine learning pattern
recognition, databases, statistics, artificial intelligence,
knowledge acquisition for expert systems, and data
visualization.
• The unifying goal of the KDD process is to extract
knowledge from data in the context of large databases.
Overview of the KDD Process
Overview of the KDD Process
Developing an understanding of the
• application domain
• the relevant prior knowledge
• the goals of the end-user
Overview of the KDD Process
Creating a target data set:
• Selecting a data set
• Focusing on a subset of variable
• Or data sample on which discovery is to be
performed.
Overview of the KDD Process
1. Data cleaning
• Removal of noise or outliers.
• Cleaning is performed for detection of syntax
error.
• Parser decides weather the given string of data is
acceptable within data specification.
Overview of the KDD Process
2. Data Integration
Where multiple data source are combine.
3. Data Selection
Where data relevant to the analysis tasks are
retrieved from the database
Overview of the KDD Process
4. Transformation
Where data are transformed or consolidated into
forms appropriate for mining by performing
summary or aggregation operations, for instance.
5. Data Mining:
An essential process where intelligent methods are
applied to extract data patterns.
Overview of the KDD Process
6. Pattern Evaluation:
To identify the truly interesting patterns representing
knowledge base on some measures.
7. Knowledge Representation:
Where visualization and knowledge representation
techniques are used to present the mined knowledge
to the users.
Major Issues in Data Warehousing
Building a data Warehouse is very difficult and a pain. It is
challenging, but it is a fabulous project to be involved in, because
when data warehouses work properly, they are magnificently useful,
huge fun and unbelievably rewarding.
Some of the major issues involved in building data warehouse are
discussed below:
• General Issues
• Technical Issues
• Cultural Issues:
General Issues
It includes but is not limited to following issues:
• What kind of analysis do the business users want to perform?
• Do you currently collect the data required to support that analysis?
• How clean is data?
• Are there multiple sources for similar data?
• What structure is best for the core data warehouse (i.e., dimensional
or relational)?
Technical Issues
It includes but is not limited to following issues:
• How much data are you going to ship around your network, and will
it be able to cope?
• How much disk space will be needed?
• How fast does the disk storage need to be?
• Are you going to use SSDs to store “hot” data (i.e., frequently
accessed information)?
• What database and data management technology expertise already
exists within the company?
Cultural Issues
It includes but is not limited to following issues:
• How do data definitions differ between your operational systems?
Different departments and business units often use their own
definitions of terms like “customer,” “sale” and “order” within
systems. So you’ll need to standardize the definitions and add
prefixes such as “all sales,” “recent sales,” “commercial sales” and
so on.
• What’s the process for gathering business requirements? Some
people will not want to spend time for you. Instead, they will expect
you to use your telepathic powers to divine their warehousing and
data analysis needs.
Applications of Data Warehousing
Information processing, analytical processing, and data mining are the
three types of data warehouse applications that are discussed below:
Information Processing - A data warehouse allows to process the data
stored in it. The data can be processed by means of querying, basic
statistical analysis, reporting using crosstabs, tables, charts, or
graphs.
Analytical Processing - A data warehouse supports analytical
processing of the information stored in it. The data can be analyzed
by means of basic OLAP operations, including slice-and-dice, drill
down, drill up, and pivoting.
Applications of Data Warehousing
Data Mining - Data mining supports knowledge discovery by finding
hidden patterns and associations, constructing analytical models,
performing classification and prediction. These mining results can be
presented using the visualization tools.
Application of Data Mining
Market Analysis and Management: Target marketing, customer relation
management, market basket analysis, cross selling, market
segmentation, Find clusters of customers who share the same
characteristics: interest, income level, spending habits, etc. Determine
customer purchasing patterns over time
Risk Analysis and Management: Forecasting, customer retention,
improved underwriting, quality control, competitive analysis, credit
scoring.
Fraud Detection and Management: Use historical data to build models
of fraudulent behavior and use data mining to help identify similar
instances. For example, detect suspicious money transactions.
Application of Data Mining
Sports: Data mining can be used to analyze shots & fouls of different
athletes, their weaknesses and helps athletes to assist in improving their
games.
Space Science: Data mining can be used to automate the analysis
image data collected from sky survey with better accuracy.
Internet Web Surf-Aid: Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discover customer preference
and behavior pages, analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Application of Data Mining
Social Web and Networks: There are a growing number of highly-popular
user-centric applications such as blogs, wikis and Web communities that
generate a lot of structured and semi-structured information. In these
applications data mining can be used to explain and predict the
evolution of social networks, personalized search for social interaction,
user behavior prediction etc.
References
https://www.comp.nus.edu.sg/~lingtw/cs4221/dw.pdf
https://www.investopedia.com/terms/d/data-warehousing.asp
http://datawarehouse4u.info/
ANY QUESTIONS?

More Related Content

Introduction to data mining and data warehousing

  • 1. Er. Nawaraj Bhandari Data Warehouse/Data Mining Chapter 1: Introduction to Data Mining and Data Warehousing
  • 2. Course Title: Data Warehousing and Data Mining (BSC-CSIT, TU) Course no: CSC-459 Credit hours: 3 Nature of course: Theory (3 Hrs.) + Lab (3 Hrs.) Full Marks: 60+20+20 Pass Marks: 24+8+8 Prerequisite: C, Data Structure, Database Course Overview
  • 5. Types of database processing • OLTP - On-line transaction processing. - It is a class of program that facilitates and manages transaction-oriented applications. - It is used for supporting daily business. • OLAP - On-line analytical processing - It is a way of viewing data in a multidimensional format. - It is used for supporting decision making.
  • 6. Transform “Data” into “Information”  Data Warehouse provides a multidimensional view of an organization’s operational (OLTP) data to help user make more informed, fast decisions.
  • 7. OLTP VS Data Warehouses(OLAP) Property OLTP Data Warehouse Nature of Data Warehouses 3NF Multidimensional Indexes Few Many Joins Many Some Duplicate data Normalized Denormalized Aggregate data Rare Common Nature of queries Mostly simple Mostly complex Updates All the time Not allowed, only refreshed Historical data Often not available Essential
  • 8. Stock taking and reordering database Customer Records database Internet and VPN or WAN LAN On-line shopping Webserver and database for On line shopping OLTP for point of salesPoint of SaleCustomer with loyalty card Supermarket Systems
  • 9. Activity – Identify the Types of Data been Collected and Used here?
  • 10. And… What Benefits from Bringing this Data Together? - 1
  • 11. And… What Benefits from Bringing this Data Together? Sales Trends Customer Buying habits Regional variations Variations by time Goods generating profit
  • 12. Data Warehouse • Subject-oriented • Integrated • Time-variant • Non-volatile What is a Data Warehouse?
  • 13. Subject Orientation Data warehouse supplier customer product A subject orientation buying A data warehouse can be use to analyse a particular subject area.
  • 14. Integration OLTP System Data warehouse App1-m,f App2-1,0 App3-male,female Integration Date(ddmmyy) App1-date(yymmdd) App2-date(mmddyy) App3-date(ddmmyy) m,f Integration Data warehouse have integrated data from multiple data source. For example data source A and data source B may have different ways of identifying product. But in data warehouse there will be Only a single way of identifying a product.
  • 15. Time Variant OLTP System Data warehouse • time horizon – 60-90 days depending on business • key will not usually have an element of time • data can be changed • time horizon – long term 5-10 years • key will contain an element of time • data cannot be changed All data in the data warehouse is identified with a particular time period. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
  • 16. Non-Volatile Operational System Data warehouse create update retrievedelete load access access … Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business
  • 17. Functionalities of Data Warehouse Data warehouse is characterized by a relatively low volume of transaction, queries are often very complex and involve aggregations. The basic operations in OLAP are: 1. Roll-Up(Consolidation) 2. Drill-down 3. Slicing 4. Dicing 5. Pivot
  • 18. Roll-Up(Consolidation) It performs aggregation on a data cubes in following ways. • Data is summarized with increased generalization. • By climbing up a concept hierarchy for a dimensions. • By dimension reduction
  • 20. Drill-Down It is reverse of roll-up: It is performed either by following ways. • By stepping down the concept hierarchy for a dimensions. • By introducing a new domain.
  • 22. Slice The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider the following diagram that shows how slice works.
  • 23. Slice
  • 24. Dice Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following diagram that shows the dice operation.
  • 25. Dice
  • 26. Pivot The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative presentation of data. Consider the following diagram that shows the pivot operation.
  • 27. Pivot
  • 28. Overview of the KDD Process • The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process of finding knowledge in data, and emphasizes the "high-level" application of particular data mining methods. • It is of interest to researchers in machine learning pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition for expert systems, and data visualization. • The unifying goal of the KDD process is to extract knowledge from data in the context of large databases.
  • 29. Overview of the KDD Process
  • 30. Overview of the KDD Process Developing an understanding of the • application domain • the relevant prior knowledge • the goals of the end-user
  • 31. Overview of the KDD Process Creating a target data set: • Selecting a data set • Focusing on a subset of variable • Or data sample on which discovery is to be performed.
  • 32. Overview of the KDD Process 1. Data cleaning • Removal of noise or outliers. • Cleaning is performed for detection of syntax error. • Parser decides weather the given string of data is acceptable within data specification.
  • 33. Overview of the KDD Process 2. Data Integration Where multiple data source are combine. 3. Data Selection Where data relevant to the analysis tasks are retrieved from the database
  • 34. Overview of the KDD Process 4. Transformation Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance. 5. Data Mining: An essential process where intelligent methods are applied to extract data patterns.
  • 35. Overview of the KDD Process 6. Pattern Evaluation: To identify the truly interesting patterns representing knowledge base on some measures. 7. Knowledge Representation: Where visualization and knowledge representation techniques are used to present the mined knowledge to the users.
  • 36. Major Issues in Data Warehousing Building a data Warehouse is very difficult and a pain. It is challenging, but it is a fabulous project to be involved in, because when data warehouses work properly, they are magnificently useful, huge fun and unbelievably rewarding. Some of the major issues involved in building data warehouse are discussed below: • General Issues • Technical Issues • Cultural Issues:
  • 37. General Issues It includes but is not limited to following issues: • What kind of analysis do the business users want to perform? • Do you currently collect the data required to support that analysis? • How clean is data? • Are there multiple sources for similar data? • What structure is best for the core data warehouse (i.e., dimensional or relational)?
  • 38. Technical Issues It includes but is not limited to following issues: • How much data are you going to ship around your network, and will it be able to cope? • How much disk space will be needed? • How fast does the disk storage need to be? • Are you going to use SSDs to store “hot” data (i.e., frequently accessed information)? • What database and data management technology expertise already exists within the company?
  • 39. Cultural Issues It includes but is not limited to following issues: • How do data definitions differ between your operational systems? Different departments and business units often use their own definitions of terms like “customer,” “sale” and “order” within systems. So you’ll need to standardize the definitions and add prefixes such as “all sales,” “recent sales,” “commercial sales” and so on. • What’s the process for gathering business requirements? Some people will not want to spend time for you. Instead, they will expect you to use your telepathic powers to divine their warehousing and data analysis needs.
  • 40. Applications of Data Warehousing Information processing, analytical processing, and data mining are the three types of data warehouse applications that are discussed below: Information Processing - A data warehouse allows to process the data stored in it. The data can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs. Analytical Processing - A data warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and-dice, drill down, drill up, and pivoting.
  • 41. Applications of Data Warehousing Data Mining - Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. These mining results can be presented using the visualization tools.
  • 42. Application of Data Mining Market Analysis and Management: Target marketing, customer relation management, market basket analysis, cross selling, market segmentation, Find clusters of customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Risk Analysis and Management: Forecasting, customer retention, improved underwriting, quality control, competitive analysis, credit scoring. Fraud Detection and Management: Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. For example, detect suspicious money transactions.
  • 43. Application of Data Mining Sports: Data mining can be used to analyze shots & fouls of different athletes, their weaknesses and helps athletes to assist in improving their games. Space Science: Data mining can be used to automate the analysis image data collected from sky survey with better accuracy. Internet Web Surf-Aid: Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
  • 44. Application of Data Mining Social Web and Networks: There are a growing number of highly-popular user-centric applications such as blogs, wikis and Web communities that generate a lot of structured and semi-structured information. In these applications data mining can be used to explain and predict the evolution of social networks, personalized search for social interaction, user behavior prediction etc.