Lecture 1 Data Mining
Lecture 1 Data Mining
CITS3401
CITS5504
Wei Liu
School of Computer
Science and Software
Engineering
Faculty of Engineering,
Computing and
Mathematics
Acknowledgement: The Lecture Slides are adapted from the original slides from Hans textbook.
Administrative
Different websites
http://teaching.csse.uwa.edu.au/units/CITS3401
http://teaching.csse.uwa.edu.au/units/CITS5504
References:
Data Mining: Methods and Techniques by, A. Shawkat Ali and
Saleh Wasimi Thomson, 2007
Data Mining: The Textbook by, Charu C. Aggarwal, Springer,
May 2015
Potential Applications
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
8
Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus
(public) lifestyle studies,
Target marketing
Find clusters of model customers who share the same characteristics:
interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Cross-market analysisFind associations/co-relations between product
sales, & predict based on such association
Customer profilingWhat types of customers buy what products
(clustering or classification)
Customer requirement analysis
Identify the best products for different groups of customers
Predict what factors will attract new customers
Provision of summary Information:
Multidimensional summary reports
Statistical summary information (data central tendency and variation)
Resource planning
summarize and compare the resources and spending
Competition
monitor competitors and market directions
Anti-terrorism:
11
Evolution of Sciences
12
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
13
Summary:
Abundance of data and data archives are seldom visited.
Far exceeded human ability for comprehension
Intuitive decisions are prone to biases and errors, and is
extremely time-consuming and costly
Data mining tools perform data analysis and uncover important
data patterns, contributing greatly to business strategies,
knowledge bases, and scientific and medical research.
Data
Tombs
Nuggets of
knowledge
14
15
Scalable
Data mining involves integration of multiple disciplines:
Machine learning
Pattern recognition
Statistics
Databases
Business Intelligence
Big data
Efficient: Derived knowledge is new, interesting, informative and
can be used for sophisticated application (decision making,
process control, information management....)
16
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines
17
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
18
19
Data to be mined
Database data (extended-relational, object-oriented,
heterogeneous, legacy), data warehouse, transactional data,
stream, spatiotemporal, time-series, sequence, text and web, multimedia, graphs & social and information networks
Knowledge to be mined (or: Data mining functions)
Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
Descriptive vs. predictive data mining
Multiple/integrated functions and mining at multiple levels
Techniques utilized (methodologies)
Data-intensive, data warehouse (OLAP), machine learning,
statistics, pattern recognition, visualization, high-performance, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining,
stock market analysis, text mining, Web mining, etc.
21
Unstructured data
Data streams and sensor data
Text data and web data
Time-series data, temporal data, sequence data (incl. biosequences)
Graphs, social networks and information networks
Spatial data, spatiotemporal data and multimedia data
22
Relational Database
Relational Database
24
An Example - AllElectronics
25
Example of Queries
27
28
Data Warehouse
29
Transactional Database
Typical methods
Decision trees, nave Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-based
classification, logistic regression,
Typical applications:
Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages,
34
35
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees
(XML), substructures (web fragments)
Information network analysis
Social networks: actors (objects, nodes) and relationships (edges)
e.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends, family,
classmates,
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining,
38
Applications
Algorithm
Pattern
Recognition
Data Mining
Database
Technology
Statistics
Visualization
Distributed /
cloud
computing
39
42
Evaluation of Knowledge
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or
test data with some degree of certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm
45
Loose coupling- Fetching data from DB/DW. Mining does not explore
data structure and optimization methods provided by DB & DW.Difficult for
high scalability.
48
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space at multiple level of
abstraction.
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
User Interaction
Interactive mining
Background knowledge (integrity constraints & deduction rules)
Presentation and visualization of data mining results
49
51