Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data-Mining FINAL

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 45

Lecture 1 & 2

What is data mining?


After years of data mining there is still no unique answer to this question.
A tentative definition:
Data mining is the use of efficient techniques for the analysis of very large collections of data and the
extraction of useful and possibly unexpected patterns in data.
Data Mining:
Data explosion problem:
Automated data collection tools and mature database technology.
Leading to tremendous amounts of data stored in databases, data warehouses and other
information repositories.
We are drowning in data, but starving for knowledge!
What is data mining?
Data mining is also called knowledge discovery and data mining (KDD)
Data mining is
extraction of useful patterns from data sources, e.g., databases, texts, web, image.
Patterns must be:
valid, novel, potentially useful, understandable
Knowledge Discovery:

Example of discovered patterns:


Association rules:
80% of customers who buy cheese and milk also buy bread, and 5% of customers buy all of them together
Cheese, Milk Bread [sup =5%, confid=80%]
Origins of Data Mining:
Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
Traditional Techniques may be unsuitable due to
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature
of data

Why do we need data mining?


Really, really huge amounts of raw data!!
In the digital age, TB of data is generated by the second
Mobile devices, digital photographs, web documents.
Facebook updates, Tweets, Blogs, User-generated content
Transactions, sensor data, surveillance data
Queries, clicks, browsing
Cheap storage has made possible to maintain this data
Need to analyze the raw data to extract knowledge
Why do we need data mining?
Data is power!
Today, the collected data is one of the biggest assets of an online company
Query logs of Google
The friendship and updates of Facebook
Tweets and follows of Twitter
Amazon transactions
We need a way to harness the collective intelligence
The data is also very complex:
Multiple types of data: tables, images, graphs, etc
Interconnected data of different types:
From the mobile phone we can collect, location of the user, friendship information, check-ins to
venues, opinions through twitter, images though cameras, queries to search engines
Example: transaction data:
Billions of real-life customers:
Credit card companies: billions of transactions per day.
The point cards allow companies to collect information about specific users
Example: document data:
Web as a document repository: estimated 50 billions of web pages
Wikipedia: 4 million articles (and counting)
Online news portals: steady stream of 100s of new articles every day
Example: network data:

Web: 50 billion pages linked via hyperlinks


Facebook: 500 million users
Twitter: 300 million users
Instant messenger: ~1billion users
Blogs: 250 million blogs worldwide, presidential candidates run blogs
Example: genomic sequences:
http://www.1000genomes.org/page.php
Full sequence of 1000 individuals
3*109 nucleotides per person 3*1012 nucleotides
Lots more data in fact: medical history of the persons, gene expression data
Example: environmental data:
Climate data (just an example)
http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php
a database of temperature, precipitation and pressure records managed by the National Climatic Data
Center, Arizona State University and the Carbon Dioxide Information Analysis Center
6000 temperature stations, 7500 precipitation stations, 2000 pressure stations
Spatiotemporal data
So, what is Data?
Collection of data objects and their attributes
An attribute is a property or characteristic of an object
Examples: eye color of a person, temperature, etc.
Attribute is also known as variable, field, characteristic, or feature
A collection of attributes describe an object
Object is also known as record, point, case, sample, entity, or instance
Types of Attributes:
There are different types of attributes
Categorical
Examples: eye color, zip codes, words, rankings (e.g, good, fair, bad), height in {tall, medium,
short}
Numeric
Examples: dates, temperature, time, length, value, count.
Discrete (counts) vs Continuous (temperature)
Special case: Binary attributes (yes/no, exists/not exists)
Numeric Record Data:
If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each dimension represents a distinct attribute
Such data set can be represented by an n-by-d data matrix, where there are
n rows, one for each object, and d columns, one for each attribute

Categorical Data:
Data that consists of a collection of records, each of which consists of a fixed set of categorical attributes

Document Data:
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term occurs in the
document.
Bag-of-words representation no ordering
timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data:
Each record (transaction) is a set of items.

A set of items can also be represented as a binary vector, where each attribute is an item.
A document can also be represented as a set of words (no counts)
Ordered Data:
Genomic sequence data

Data is a long ordered string


Graph Data:
Examples: Web graph and HTML Links

Types of data:
Numeric data: Each object is a point in a multidimensional space
Categorical data: Each object is a vector of categorical values
Set data: Each object is a set of values (with or without counts)
Sets can also be represented as binary vectors, or vectors of counts
Ordered sequences: Each object is an ordered sequence of values.
Graph data: a graph is an abstract data type that is meant to implement the undirected graph and directed
graph concepts from mathematics.
What can you do with the data?
Suppose that you are the owner of a supermarket and you have collected billions of market basket data.
What information would you extract from it and how would you use it?
What if this was an online store?

Suppose you are biologist who has microarray expression data: thousands of genes, and their expression
values over thousands of different settings (e.g. tissues). What information would you like to get out of your
data?
Why Mine Data? Commercial Viewpoint:
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Why Mine Data? Scientific Viewpoint:
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
What is Data Mining again?
Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to
summarize the data in novel ways that are both understandable and useful to the data analyst (Hand,
Mannila, Smyth)
Data mining is the discovery of models for data (Rajaraman, Ullman)
We can have the following types of models
Models that explain the data (e.g., a single function)
Models that predict the future data instances.
Models that summarize the data
Models the extract the most prominent features of the data.
Clustering Definition:
Given a set of data points, each having a set of attributes, and a similarity measure among them, find
clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.

Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be
selected as a market target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster vs.
those from different clusters.
Clustering: Application 2
Bioinformatics applications:
Goal: Group genes and tissues together such that genes are co-expressed on the same tissues
Document Clustering:
Goal: To find groups of documents that are similar to each other based on the important terms
appearing in them.
Approach:
To identify frequently occurring terms in each document.
Form a similarity measure based on the frequencies of different terms.
Use it to cluster.
Gain:
Information Retrieval can utilize the clusters to relate a new document or search term to
clustered documents.
Illustrating Document Clustering:
Clustering Points: 3204 Articles of Los Angeles Times.
Similarity Measure: How many words are common in these documents (after some word filtering).

Frequent Itemsets and Association Rules:


Given a set of records each of which contain some number of items from a given collection;
Identify sets of items (itemsets) occurring frequently together
Produce dependency rules which will predict occurrence of an item based on occurrences of other
items.

Frequent Itemsets: Applications


Text mining: finding associated phrases in text
There are lots of documents that contain the phrases association rules, data mining and
efficient algorithm
Recommendations:
Users who buy this item often buy this item as well
Users who watched James Bond movies, also watched Jason Bourne movies.
Recommendations make use of item and user similarity
Association Rule Discovery: Application
Supermarket shelf management.
Goal: To identify items that are bought together by sufficiently many customers.
Approach: Process the point-of-sale data collected with barcode scanners to find dependencies
among items.
A classic rule --
If a customer buys diaper and milk, then he is very likely to buy beer.
So, dont be surprised if you find six-packs stacked next to diapers!
Sequential pattern mining:
Sequential pattern mining:
A sequential rule: A B, says that event A will be immediately followed by event B with a certain confidence.
Regression:
Predict a value of a given continuous valued variable based on the values of other variables, assuming a
linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based on advertising expenditure.
Predicting wind velocities as a function of temperature, humidity, air pressure, etc.
Time series prediction of stock market indices.
Deviation/Anomaly Detection:
Detect significant deviations from normal behavior
Discovering the most significant changes in data
Applications:
Credit Card Fraud Detection
Network Intrusion Detection
Challenges of Data Mining:
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
Lecture 3 & 4
What is Data Mining?
Data mining is the use of efficient techniques for the analysis of very large collections of data and the
extraction of useful and possibly unexpected patterns in data.
Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to
summarize the data in novel ways that are both understandable and useful to the data analyst (Hand,
Mannila, Smyth)
Data mining is the discovery of models for data (Rajaraman, Ullman)
We can have the following types of models
Models that explain the data (e.g., a single function)
Models that predict the future data instances.
Models that summarize the data
Models the extract the most prominent features of the data.
Why do we need data mining?
Really huge amounts of complex data generated from multiple sources and interconnected in different ways
Scientific data from different disciplines
Huge text collections
Transaction data
Behavioral data
Networked data
All these types of data can be combined in many ways
We need to analyze this data to extract knowledge
Knowledge can be used for commercial or scientific purposes.
Our solutions should scale to the size of the data
#The data analysis pipeline:
Mining is not the only step in the analysis process

Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning is required to make sense of the
data
Techniques: Sampling, Dimensionality Reduction, Feature selection.
A dirty work, but it is often the most important step for the analysis.
Post-Processing: Make the data actionable and useful to the user
Statistical analysis of importance
Visualization.
Pre- and Post-processing are often data mining tasks as well
#Data Quality
Examples of data quality problems:
Noise and outliers
Missing values
Duplicate data

Why Data Preprocessing?


Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
e.g., occupation=
noisy: containing errors or outliers
e.g., Salary=-10
inconsistent: containing discrepancies in codes or names
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
Why Is Data Dirty?
Incomplete data may come from
Not applicable data value when collected
Different considerations between the time when the data was collected and when it is analyzed.
Human/hardware/software problems
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Different data sources
Functional dependency violation (e.g., modify some linked data)
Duplicate records also need data cleaning
Why Is Data Preprocessing Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or even misleading statistics.
Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the majority of the work of building a data
warehouse
#Multi-Dimensional Measure of Data Quality:
A well-accepted multidimensional view:
Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
Intrinsic, contextual, representational, and accessibility
#Major Tasks in Data Preprocessing:
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
#Forms of Data Preprocessing:

Chapter 2: Data Preprocessing


Why preprocess the data?
Descriptive data summarization
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Summary
#Mining Data Descriptive Characteristics
Motivation
To better understand the data: central tendency, variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
#Central Tendency:
A measure of central tendency is a value at the center or middle of a data set.
Mean, median, mode
#Terminology:
Population
A collection of items of interest in research
A complete set of things
A group that you wish to generalize your research to
An example All the trees in Battle Park
Sample
A subset of a population
The size smaller than the size of a population
An example 100 trees randomly selected from Battle Park
#Sample vs. Population

#Measures of Central Tendency Mean:


Mean Most commonly used measure of central tendency
Average of all observations
The sum of all the scores divided by the number of scores
Note: Assuming that each observation is equally significant
#Measures of Central Tendency Mean:

Example I
- Data: 8, 4, 2, 6, 10


Example II
Sample: 10 trees randomly selected from Battle Park
Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5

Weighted Mean:
We can also calculate a weighted mean using some weighting factor:
e.g. What is the average income of all
people in cities A, B, and C:
City Avg. Income Population
A $23,000 100,000
B $20,000 50,000
C $25,000 150,000
Here, population is the weighting factor and the average income is the variable of interest

#Measures of Central Tendency Median:


Median This is the value of a variable such that half of the observations are above and half are below
this value i.e. this value divides the distribution into two groups of equal size
When the number of observations is odd, the median is simply equal to the middle value
When the number of observations is even, we take the median to be the average of the two values in the
middle of the distribution

Example I
Data: 8, 4, 2, 6, 10 (mean: 6)

Example II
Sample: 10 trees randomly selected from Battle Park
Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
(mean: 14.38)

For calculation of median in a continuous frequency distribution the following formula will be
employed. Algebraically,

Find the value of N/2 &Find cf for median class:

Measures of Central Tendency Mode:


Mode Mode is the most frequent value or score in the distribution.
It is defined as that value of the item in a series
Example I

The exact value of mode can be obtained by the following formula.


Value that occurs most frequently in the data
Empirical formula:

Symmetric vs. Skewed Data:


Median, mean and mode of symmetric, positively and negatively skewed data

Data Skewed Right:

Here we see that the data is skewed to the right and the position of the Mean is to the right of the Median.
One may surmise that there is data that is tending to spread the data out at the high end, thereby
affecting the value of the mean.
Data Skewed left:

Here we see that the data is skewed to the left and the position of the Mean is to the left of the Median.
One may surmise that there is data that is tending to spread the data out at the low end, thereby
affecting the value of the mean.
Measuring the Dispersion of Data:
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: )
Variance: (algebraic, scalable computation)

Standard deviation s (or ) is the square root of variance s2 (or 2)


Summary Measures

Quartiles:
Quartiles split the ranked data into 4 segments with an equal number of values per segment

The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are larger)
Only 25% of the observations are greater than the third quartile
Find a quartile by determining the value in the appropriate position in the ranked data, where
First quartile position : Q1 at (n+1)/4
Second quartile position : Q2 at (n+1)/2 (median)
Third quartile position : Q3 at 3(n+1)/4
where n is the number of observed values
Interquartile Range:
Can eliminate some outlier problems by using the interquartile range
Eliminate some high- and low-valued observations and calculate the range from the remaining values
Interquartile range = 3rd quartile 1st quartile
= Q3 Q1
Quartiles:
Example 1: Find the median and quartiles for the data below.

Example 2: Find the median and quartiles for the data below.

Range:
Simplest measure of variation
Difference between the largest and the smallest observations:
Disadvantages = ignores distribution of data and sensitive to outliers

Boxplot Analysis:
Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and Maximum
#Drawing a Box Plot :
Example 1: Draw a Box plot for the data below

Example 2: Draw a Box plot for the data below

Question: Stuart recorded the heights in cm of boys in his class as shown below. Draw a box plot for this data.

Question: Gemma recorded the heights in cm of girls in the same class and constructed a box plot from the
data. The box plots for both boys and girls are shown below. Use the box plots to choose some correct
statements comparing heights of boys and girls in the class. Justify your answers.

Outliers:
outliers Sometimes there are extreme values that are separated from the rest of the data. These extreme
values are called outliers. Outliers affect the mean.
The 1.5 IQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 IQR above the third quartile or below the first
quartile.
X < Q1 1.5 IQR
X > Q3+ 1.5 IQR
In the New York travel time data, we found Q1 = 15 minutes, Q3 = 42.5 minutes, and IQR = 27.5 minutes.
For these data, 1.5 IQR = 1.5(27.5) = 41.25
Q1 1.5 IQR = 15 41.25 = 26.25 (near 0)
Q3+ 1.5 IQR = 42.5 + 41.25 = 83.75 (~80)
Any travel time close to 0 minutes or longer than about 80 minutes is considered an outlier.
Boxplots and outliers:

AFTER MID-TERM:
Lecture_05preprocessing
Measuring the Dispersion of Data:
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: )
Variance: (algebraic, scalable computation)

Standard deviation s (or ) is the square root of variance s2 (or 2)


Variance and Standard deviation:
While the central value of a sample is important, the variations around that value are often equally
important.
Recall that variability exists when some values are different from (above or below) the mean.
what is a typical deviation from the mean?
small values of this typical deviation indicate small variability in the data
large values of this typical deviation indicate large variability in the data
Each data value has an associated deviation from the mean:

The Mean Deviation is a measure of dispersion which calculates


distance between each data point and the mean
finds the average of these distances.

Variance is the average squared deviation from the mean of a set of data.
It is used to find the standard deviation.
Population variance
Sample variance
Calculating Variance:
1. Find the mean of the data.
Hint mean is the average so add up the values and divide by the number of items.
2. Subtract the mean from each value the
result is called the deviation from the mean.
3. Square each deviation of the mean.
4. Find the sum of the squares.
5. Divide the total by the number of items.
The sample variance is defined as follows:
The short-cut sample variance is defined as follows:

The population variance is defined as follows:

Where 2 is called the variance of the population


Mean is represented by
Metabolic rates of 7 men (cal./24hr.) :
1792 1666 1362 1614 1460 1867 1439

The standard deviation is the positive square root of the variance:


Population standard deviation
Sample standard deviation
If the data is close together, the standard deviation will be small.
If the data is spread out, the standard deviation will be large.
Standard deviation of the population:

Standard deviation of the sample:

Standard deviation:
Find the mean of the data.
Subtract the mean from each value.
Square each deviation of the mean.
Find the sum of the squares.
Divide the total by the number of items.
Take the square root of the variance.
Question:The math test scores of five students are: 92,88,80,68 and 52.
1) Find the mean: (92+88+80+68+52)/5 = 76.
2) Find the deviation from the mean:
92-76=16
88-76=12
80-76=4
68-76= -8
52-76= -24
3) Square the deviation from the mean:

4) Find the sum of the squares of the deviation from the mean:
256+144+16+64+576= 1056
5) Divide by the number of data items to find the variance:
1056/5 = 211.2
6) Find the square root of the variance:

Thus the standard deviation of the test scores is 14.53.


Empirical Rule
If the data has a bell shaped (normal) distribution then Chebyshevs theorem can be improved.
The proportion (or fraction) of any data set lying
Within 1 standard deviation of the mean is about 68%.
Within 2 standard deviations of the mean is about 95%.
Within 3 standard deviations of the mean is about 99.7%.

Properties of Normal Distribution Curve:


The normal (distribution) curve
From to +: contains about 68% of the measurements (: mean, : standard deviation)
From 2 to +2: contains about 95% of it
From 3 to +3: contains about 99.7% of it

Summary Measures:
LECTURE06: PREPROCESSING:
Graphic Displays of Basic Statistical Descriptions:
Histogram: (shown before)
Boxplot: (covered before)
Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding
quantiles of another
Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the
pattern of dependence
Boxplot Analysis:

Five-number summary of a distribution:


Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and Maximum

Positively and Negatively Correlated Data:


Positive Correlation: The correlation is said to be positive correlation if the values of two variables changing
with same direction. Example: Height & weight

Negative Correlation: The correlation is said to be negative correlation when the values of variables change
with opposite direction.
Linear Correlation:

Correlation Coefficient r:
A measure of the strength and direction of a linear relationship between two variables

The range of r is from 1 to 1.

Application:
Loess Curve:
Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence
Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials
that are fitted by the regression

LECTURE07: PREPROCESSING
Data Cleaning:
Importance
Data cleaning is one of the three biggest problems in data warehousingRalph Kimball
Data cleaning is the number one problem in data warehousingDCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Missing Data:
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales
data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the tasks in classificationnot effective
when the percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or decision tree
Noisy Data:
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with possible outliers)
Data Transformation:
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation: Normalization:

Data Reduction Strategies


Why data reduction?
A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on the complete data set
Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet produce the
same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation:
Dimensionality reduction e.g., remove unimportant attributes
Data Compression
Numerosity reduction e.g., fit data into models
Discretization and concept hierarchy generation
Lecture 8 &_9_Apriory_Rule_mining
Apriori:
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a
data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association
rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together? Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream)
analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?
Discloses an intrinsic and important property of data sets
Forms the foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
Classification: associative classification
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
Frequent Itemsets :

Given a set of transactions, find combinations of items (itemsets) that occur frequently

Applications (1):

Items = products; baskets = sets of products someone bought in one trip to the store.

Example application: given that many people buy beer and diapers together:

Run a sale on diapers; raise price of beer.

Only useful if many buy diapers & beer.

Applications (2):

Baskets = Web pages; items = words.

Example application: Unusual words appearing together in a large number of documents, e.g., Brad and
Angelina, may indicate an interesting relationship.

Applications (3):

Baskets = sentences; items = documents containing those sentences.

Example application: Items that appear together too often could represent plagiarism.

Notice items do not have to be in baskets.

Definition: Frequent Itemset


Association Rule Mining:

Definition: Association Rule

Rule Measures: Support and Confidence


Basic Concepts: Frequent Patterns and Association Rules

Example:

Association Rule Mining Task:

Given a set of transactions T, the goal of association rule mining is to find all rules having

support minsup threshold

confidence minconf threshold

Brute-force approach:

List all possible association rules

Compute the support and confidence for each rule

Prune rules that fail the minsup and minconf thresholds

Computationally prohibitive!

Mining Association Rules:

Two-step approach:

Frequent Itemset Generation

Generate all itemsets whose support minsup


Rule Generation

Generate high confidence rules from each frequent itemset, where each rule is a binary
partitioning of a frequent itemset

Frequent itemset generation is still computationally expensive

Scalable Methods for Mining Frequent Patterns:

The Apriori Principle:

Illustration of the Apriori principle:

Illustration of the Apriori principle:

Apriori: A Candidate Generation-and-Test Approach


The Apriori AlgorithmAn Example:

Generating Candidates Ck+1 in SQL:


p.item1=q.item1,p.item2=q.item2, p.item3< q.item3

Example II:
Lecture_10_Apriory_Rule_mining(MIDTERM)

Step 1: scans all of the transactions in order to count the number of occurrences of each item- C1

Step 2: Determine the set of frequent 1-itemsets L1

Step 3: Generate C2- candidate 2-itemsets


Step 4: Generate C2- candidate 2-itemsets and scan the database

Step 4: Determine the set of frequent 2-itemsets L2

Step 5: Generate C3- candidate 3-itemsets (pruning)

Step 5: Generate C3- candidate 3-itemsets (pruning)

Step 5: Generate C3- candidate 3-itemsets


Step 6: Determine the set of frequent 2-itemsets L2

Step 7: Determine the set of frequent 2-itemsets L2

Generate Association Rule:


Once the frequent itemsets from database have been found, it is straightforward to generate strong
association rules from them
strong association rules-association rules satisfy both minimum support and minimum confidence

Suppose the data contain the frequent itemset l={I1,I2,I5}


The nonempty subsets of l are:
{I1, I2},
{I1, I5},
{I2, I5},
{I1},
{I2},
{I5}
Suppose the data contain the frequent itemset l={I1,I2,I5}
The resulting association rules are as shown below, each listed with its confidence:

Lecture_11_Apriory_TID
Discovering Large Itemsets:
Multiple passes over the data
First pass count the support of individual items.
Subsequent pass
Generate Candidates using previous passs large itemset.
Go over the data and check the actual support of the candidates.
Stop when no new large itemsets are found.
Algorithm Apriori:
Problem?

Apriori Problem:

Database of transactions is massive


Can be millions of transactions added an hour
Passing through database is expensive
Later passes transactions dont contain large itemsets
Dont need to check those transactions
Advantage:
C^k could be smaller than the database.
If a transaction does not contain k-itemset candidates, than it will be excluded from C^k .
For large k, each entry may be smaller than the transaction
The transaction might contain only few candidates.
Disadvantage:
Memory Overhead
For small k, each entry may be larger than the corresponding transaction.
An entry includes all k-itemsets contained in the transaction.
AprioriTid Example:
>>>>>>>>>>>>>>>>

FP Mining with Vertical Data Format:


Both Apriori and FP-growth use horizontal data format

Alternatively data can also be represented in vertical format


Transform the horizontally formatted data to the vertical format by scanning the database once
The support count of an itemset is simply the length of the TID_set of the itemset

Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets

The frequent k-itemsets can be used to construct the candidate (k+1)-itemsets based on the Apriori property

The frequent k-itemsets can be used to construct the candidate (k+1)-itemsets based on the Apriori property

Mining Various Kinds of Association Rules:


Mining multilevel association
Miming multidimensional association
Mining quantitative association
Mining interesting correlation patterns
Multilevel association rules involve concepts at different levels of abstraction.
Multidimensional association rules involve more than one dimension or predicate
e.g., rules relating what a customer buys as well as the customers age.
Quantitative association rules involve numeric attributes that have an implicit ordering among values
(e.g., age).
Mining Multiple-Level Association Rules:
It is difficult to find interesting purchase patterns
An AllElectronics store, showing the items purchased for each transaction.
IBM-ThinkPad-R40/P4M or Symantec-Norton-Antivirus-2003 occurs in a very small fraction of the
transactions
difficult to find strong associations

Data can be generalized by replacing low-level concepts within the data by their higher-level concepts.

strong associations between generalized abstractions of the items


IBM laptop computer and antivirus software.

#####
Items often form hierarchies
Flexible support settings
Items at the lower level are expected to have lower support
Exploration of shared multi-level mining
Multi-Dimensional Association: Concepts

Strong Rules Are Not Necessarily Interesting:

Misleading strong association rule:

Association Analysis to Correlation Analysis:


Computing Interestingness Measure:
Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency
table

Example: Lift/Interest
Lift is a simple correlation measure
Occurrence of itemset A is independent of the occurrence of itemset B if P(AUB) = P(A) P(B)
Lift of two itemset (A,B)

Statistical Independence:
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(SB) = 420/1000 = 0.42
P(S) P(B) = 0.6 0.7 = 0.42
P(SB) = P(S) P(B) => Statistical independence
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
500 students know how to swim and bike (S,B)
P(SB) = 500/1000 = 0.5
P(S) P(B) = 0.6 0.7 = 0.42
P(SB) > P(S) P(B) => Positively correlated
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
300 students know how to swim and bike (S,B)
P(SB) = 300/1000 = 0.3
P(S) P(B) = 0.6 0.7 = 0.42
P(SB) < P(S) P(B) => Negatively correlated
Example: Lift/Interest
Example: Lift/Interest

Example: Lift/Interest

Lecture_17_18_ClusteringI
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and
different from (or unrelated to) the objects in other groups

Cluster Analysis:
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the characteristics found in the data and grouping
similar data objects into clusters
Unsupervised learning: no predefined classes
What Is Good Clustering?
A good clustering method will produce high quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its ability to discover some or all of the hidden
patterns.
Application of Clustering:
Applications of clustering algorithm includes
Pattern Recognition
Spatial Data Analysis
Image Processing
Economic Science (especially market research)
Web analysis and classification of documents
Classification of astronomical data and classification of objects found in an archaeological study
Medical science
Requirements of Clustering in Data Mining:
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
Outliers:
Outliers are objects that do not belong to any cluster or form clusters of very small cardinality
In some applications we are interested in discovering outliers, not clusters (outlier analysis)
Distance:

Major Clustering Approach:


Partitioning approach
Construct various partitions and then evaluate them by some criterion
Typical methods:
k-means,
k-medoids,
Squared Error Clustering Algorithm
Nearest neighbor algorithm
Hierarchical approach
Hierarchical methods obtain a nested partition of the objects resulting in a tree of clusters.
Typical methods:
BIRCH(Balanced Iterative Reducing and Clustering Using Hierarchies),
ROCK(A Hierarchical Clustering Algorithm for Categorical Attributes).
Chameleon(A Hierarchical Clustering Algorithm Using Dynamic Modeling).
Density-based approach
Based on connectivity and density functions
Typical methods:
Density based methods include DBSCAN(A Density-Based Clustering Method on Connected
Regions with Sufficiently High Density),
OPTICS( Ordering Points to Identify the Clustering Structure), DENCLUE(Clustering Based on
Density Distribution Functions)
Grid-based approach
Based on a multiple-level granularity structure
Typical methods:
STING(Statistical Information Grid),
WaveCluster(Clustering Using Wavelet Transformation)

LIQUE are
some
example of
grid-based
method.

Clustering Algorithm(K-means):
K-means Algorithm: The K-means algorithm may be described as follows
1. Select the number of clusters. Let this number be K
2. Pick K seeds as centroids of the k clusters. The seeds may be picked randomly unless the user has
some insight into the data.
3. Compute the Euclidean distance of each object in the dataset from each of the centroids.
4. Allocate each object to the cluster it is nearest to base on the distances computer in the previous
step.
5. Compute the centroids of the clusters by computing the means of the attribute values of the objects
in each cluster.
6. Cheek if the stopping criterion has been met(e.g. the cluster membership is unchanged) if yes go to
step 7. If not, go to step 3.
7. [optional] One may decide to stop at this stage or to split a cluster or combine two clusters
heuristically until a stopping criterion is met.
K-means Example:
Consider the data about students. The only attributes are the age and the three marks

Steps 1 and 2: Let the three seeds be first three students.


Now compute the distances
Based on these distances, each student is allocated to the nearest cluster.
Use the new cluster means to re compute the distance of each object to each of the means, again allocating
each object to the nearest cluster.
No changes in member
We have done.

Example of K-Mean:

K-Means:
Strengths
Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t
<< n.
Often terminates at a local optimum.
Weaknesses
Applicable only when mean is defined (what about categorical data?)
Need to specify k, the number of clusters, in advance
Trouble with noisy data and outliers
Not suitable to discover clusters with non-convex shapes
K-means Example:
The results of the k-means method depend strongly on the initial guesses of the seeds.
The k-means method can be sensitive to outliers. If an outlier is picked as a starting seed, it may end
up in a cluster of its own. Also if an outlier moves from one cluster to another during iterations, it
can have a major impact on the clusters because the means of the two clusters are likely to change
significantly.
Although some local optimum solutions discovered by the K-means method are satisfactory, often
the local optimum is not as good as the global optimum.
The K-means method does not consider the size of the clusters. Some clusters may be large and
some very small.
The K-means does not deal with overlapping clusters.
Nearest Neighbor Algorithm:
An algorithm similar to the single link technique is called the nearest neighbor algorithm.
With this serial algorithm, items are iteratively merged into the existing clusters that are closet.
In this algorithm a threshold, t is used to determine if items will be added to existing clusters or if a new
cluster is created.
Nearest Neighbor Algorithm Example:

A placed to a cluster by itself K1={A}

Consider B, should it be added to K1 or form a new cluster?


Dist(A,B)=1 and less than threshold value 2
So K1={A, B}

Nearest Neighbor Algorithm Example:


For C we calculate distance from both A and B.
Dist(AB, C)= min{dist(A, C), Dist(B, C)}
Dist(AB, C)=2
So K1={A, B, C}

Dist(ABC, D)= min{Dist(A, D), Dist(B, D),Dist(C, D)}


=min{2,4,1} =1
So K1={A, B, C, D} \

Dist(ABCD, E)= min{Dist(A, E), Dist(B, E),Dist(C, E), Dist(C, E)}


=min{3, 3, 5, 3} =3 greater than threshold value.
So K1={A, B, C, D}
And K2={E}

You might also like