Data-Mining FINAL
Data-Mining FINAL
Data-Mining FINAL
Categorical Data:
Data that consists of a collection of records, each of which consists of a fixed set of categorical attributes
Document Data:
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times the corresponding term occurs in the
document.
Bag-of-words representation no ordering
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data:
Each record (transaction) is a set of items.
A set of items can also be represented as a binary vector, where each attribute is an item.
A document can also be represented as a set of words (no counts)
Ordered Data:
Genomic sequence data
Types of data:
Numeric data: Each object is a point in a multidimensional space
Categorical data: Each object is a vector of categorical values
Set data: Each object is a set of values (with or without counts)
Sets can also be represented as binary vectors, or vectors of counts
Ordered sequences: Each object is an ordered sequence of values.
Graph data: a graph is an abstract data type that is meant to implement the undirected graph and directed
graph concepts from mathematics.
What can you do with the data?
Suppose that you are the owner of a supermarket and you have collected billions of market basket data.
What information would you extract from it and how would you use it?
What if this was an online store?
Suppose you are biologist who has microarray expression data: thousands of genes, and their expression
values over thousands of different settings (e.g. tissues). What information would you like to get out of your
data?
Why Mine Data? Commercial Viewpoint:
Lots of data is being collected
and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Why Mine Data? Scientific Viewpoint:
Data collected and stored at
enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
microarrays generating gene
expression data
scientific simulations
generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists
in classifying and segmenting data
in Hypothesis Formation
What is Data Mining again?
Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to
summarize the data in novel ways that are both understandable and useful to the data analyst (Hand,
Mannila, Smyth)
Data mining is the discovery of models for data (Rajaraman, Ullman)
We can have the following types of models
Models that explain the data (e.g., a single function)
Models that predict the future data instances.
Models that summarize the data
Models the extract the most prominent features of the data.
Clustering Definition:
Given a set of data points, each having a set of attributes, and a similarity measure among them, find
clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be
selected as a market target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster vs.
those from different clusters.
Clustering: Application 2
Bioinformatics applications:
Goal: Group genes and tissues together such that genes are co-expressed on the same tissues
Document Clustering:
Goal: To find groups of documents that are similar to each other based on the important terms
appearing in them.
Approach:
To identify frequently occurring terms in each document.
Form a similarity measure based on the frequencies of different terms.
Use it to cluster.
Gain:
Information Retrieval can utilize the clusters to relate a new document or search term to
clustered documents.
Illustrating Document Clustering:
Clustering Points: 3204 Articles of Los Angeles Times.
Similarity Measure: How many words are common in these documents (after some word filtering).
Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning is required to make sense of the
data
Techniques: Sampling, Dimensionality Reduction, Feature selection.
A dirty work, but it is often the most important step for the analysis.
Post-Processing: Make the data actionable and useful to the user
Statistical analysis of importance
Visualization.
Pre- and Post-processing are often data mining tasks as well
#Data Quality
Examples of data quality problems:
Noise and outliers
Missing values
Duplicate data
Example I
- Data: 8, 4, 2, 6, 10
Example II
Sample: 10 trees randomly selected from Battle Park
Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
Weighted Mean:
We can also calculate a weighted mean using some weighting factor:
e.g. What is the average income of all
people in cities A, B, and C:
City Avg. Income Population
A $23,000 100,000
B $20,000 50,000
C $25,000 150,000
Here, population is the weighting factor and the average income is the variable of interest
Example I
Data: 8, 4, 2, 6, 10 (mean: 6)
Example II
Sample: 10 trees randomly selected from Battle Park
Diameter (inches):
9.8, 10.2, 10.1, 14.5, 17.5, 13.9, 20.0, 15.5, 7.8, 24.5
(mean: 14.38)
For calculation of median in a continuous frequency distribution the following formula will be
employed. Algebraically,
Here we see that the data is skewed to the right and the position of the Mean is to the right of the Median.
One may surmise that there is data that is tending to spread the data out at the high end, thereby
affecting the value of the mean.
Data Skewed left:
Here we see that the data is skewed to the left and the position of the Mean is to the left of the Median.
One may surmise that there is data that is tending to spread the data out at the low end, thereby
affecting the value of the mean.
Measuring the Dispersion of Data:
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: )
Variance: (algebraic, scalable computation)
Quartiles:
Quartiles split the ranked data into 4 segments with an equal number of values per segment
The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are larger)
Only 25% of the observations are greater than the third quartile
Find a quartile by determining the value in the appropriate position in the ranked data, where
First quartile position : Q1 at (n+1)/4
Second quartile position : Q2 at (n+1)/2 (median)
Third quartile position : Q3 at 3(n+1)/4
where n is the number of observed values
Interquartile Range:
Can eliminate some outlier problems by using the interquartile range
Eliminate some high- and low-valued observations and calculate the range from the remaining values
Interquartile range = 3rd quartile 1st quartile
= Q3 Q1
Quartiles:
Example 1: Find the median and quartiles for the data below.
Example 2: Find the median and quartiles for the data below.
Range:
Simplest measure of variation
Difference between the largest and the smallest observations:
Disadvantages = ignores distribution of data and sensitive to outliers
Boxplot Analysis:
Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ
The median is marked by a line within the box
Whiskers: two lines outside the box extend to Minimum and Maximum
#Drawing a Box Plot :
Example 1: Draw a Box plot for the data below
Question: Stuart recorded the heights in cm of boys in his class as shown below. Draw a box plot for this data.
Question: Gemma recorded the heights in cm of girls in the same class and constructed a box plot from the
data. The box plots for both boys and girls are shown below. Use the box plots to choose some correct
statements comparing heights of boys and girls in the class. Justify your answers.
Outliers:
outliers Sometimes there are extreme values that are separated from the rest of the data. These extreme
values are called outliers. Outliers affect the mean.
The 1.5 IQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 IQR above the third quartile or below the first
quartile.
X < Q1 1.5 IQR
X > Q3+ 1.5 IQR
In the New York travel time data, we found Q1 = 15 minutes, Q3 = 42.5 minutes, and IQR = 27.5 minutes.
For these data, 1.5 IQR = 1.5(27.5) = 41.25
Q1 1.5 IQR = 15 41.25 = 26.25 (near 0)
Q3+ 1.5 IQR = 42.5 + 41.25 = 83.75 (~80)
Any travel time close to 0 minutes or longer than about 80 minutes is considered an outlier.
Boxplots and outliers:
AFTER MID-TERM:
Lecture_05preprocessing
Measuring the Dispersion of Data:
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 Q1
Five number summary: min, Q1, M, Q3, max
Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: )
Variance: (algebraic, scalable computation)
Variance is the average squared deviation from the mean of a set of data.
It is used to find the standard deviation.
Population variance
Sample variance
Calculating Variance:
1. Find the mean of the data.
Hint mean is the average so add up the values and divide by the number of items.
2. Subtract the mean from each value the
result is called the deviation from the mean.
3. Square each deviation of the mean.
4. Find the sum of the squares.
5. Divide the total by the number of items.
The sample variance is defined as follows:
The short-cut sample variance is defined as follows:
Standard deviation:
Find the mean of the data.
Subtract the mean from each value.
Square each deviation of the mean.
Find the sum of the squares.
Divide the total by the number of items.
Take the square root of the variance.
Question:The math test scores of five students are: 92,88,80,68 and 52.
1) Find the mean: (92+88+80+68+52)/5 = 76.
2) Find the deviation from the mean:
92-76=16
88-76=12
80-76=4
68-76= -8
52-76= -24
3) Square the deviation from the mean:
4) Find the sum of the squares of the deviation from the mean:
256+144+16+64+576= 1056
5) Divide by the number of data items to find the variance:
1056/5 = 211.2
6) Find the square root of the variance:
Summary Measures:
LECTURE06: PREPROCESSING:
Graphic Displays of Basic Statistical Descriptions:
Histogram: (shown before)
Boxplot: (covered before)
Quantile plot: each value xi is paired with fi indicating that approximately 100 fi % of data are xi
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding
quantiles of another
Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane
Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the
pattern of dependence
Boxplot Analysis:
Negative Correlation: The correlation is said to be negative correlation when the values of variables change
with opposite direction.
Linear Correlation:
Correlation Coefficient r:
A measure of the strength and direction of a linear relationship between two variables
Application:
Loess Curve:
Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence
Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials
that are fitted by the regression
LECTURE07: PREPROCESSING
Data Cleaning:
Importance
Data cleaning is one of the three biggest problems in data warehousingRalph Kimball
Data cleaning is the number one problem in data warehousingDCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
Missing Data:
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales
data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of entry
not register history or changes of the data
Missing data may need to be inferred.
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the tasks in classificationnot effective
when the percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula or decision tree
Noisy Data:
Noise: random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with possible outliers)
Data Transformation:
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Transformation: Normalization:
Given a set of transactions, find combinations of items (itemsets) that occur frequently
Applications (1):
Items = products; baskets = sets of products someone bought in one trip to the store.
Example application: given that many people buy beer and diapers together:
Applications (2):
Example application: Unusual words appearing together in a large number of documents, e.g., Brad and
Angelina, may indicate an interesting relationship.
Applications (3):
Example application: Items that appear together too often could represent plagiarism.
Example:
Given a set of transactions T, the goal of association rule mining is to find all rules having
Brute-force approach:
Computationally prohibitive!
Two-step approach:
Generate high confidence rules from each frequent itemset, where each rule is a binary
partitioning of a frequent itemset
Example II:
Lecture_10_Apriory_Rule_mining(MIDTERM)
Step 1: scans all of the transactions in order to count the number of occurrences of each item- C1
Lecture_11_Apriory_TID
Discovering Large Itemsets:
Multiple passes over the data
First pass count the support of individual items.
Subsequent pass
Generate Candidates using previous passs large itemset.
Go over the data and check the actual support of the candidates.
Stop when no new large itemsets are found.
Algorithm Apriori:
Problem?
Apriori Problem:
Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets
The frequent k-itemsets can be used to construct the candidate (k+1)-itemsets based on the Apriori property
The frequent k-itemsets can be used to construct the candidate (k+1)-itemsets based on the Apriori property
Data can be generalized by replacing low-level concepts within the data by their higher-level concepts.
#####
Items often form hierarchies
Flexible support settings
Items at the lower level are expected to have lower support
Exploration of shared multi-level mining
Multi-Dimensional Association: Concepts
Example: Lift/Interest
Lift is a simple correlation measure
Occurrence of itemset A is independent of the occurrence of itemset B if P(AUB) = P(A) P(B)
Lift of two itemset (A,B)
Statistical Independence:
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(SB) = 420/1000 = 0.42
P(S) P(B) = 0.6 0.7 = 0.42
P(SB) = P(S) P(B) => Statistical independence
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
500 students know how to swim and bike (S,B)
P(SB) = 500/1000 = 0.5
P(S) P(B) = 0.6 0.7 = 0.42
P(SB) > P(S) P(B) => Positively correlated
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
300 students know how to swim and bike (S,B)
P(SB) = 300/1000 = 0.3
P(S) P(B) = 0.6 0.7 = 0.42
P(SB) < P(S) P(B) => Negatively correlated
Example: Lift/Interest
Example: Lift/Interest
Example: Lift/Interest
Lecture_17_18_ClusteringI
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be similar (or related) to one another and
different from (or unrelated to) the objects in other groups
Cluster Analysis:
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to the characteristics found in the data and grouping
similar data objects into clusters
Unsupervised learning: no predefined classes
What Is Good Clustering?
A good clustering method will produce high quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by its ability to discover some or all of the hidden
patterns.
Application of Clustering:
Applications of clustering algorithm includes
Pattern Recognition
Spatial Data Analysis
Image Processing
Economic Science (especially market research)
Web analysis and classification of documents
Classification of astronomical data and classification of objects found in an archaeological study
Medical science
Requirements of Clustering in Data Mining:
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
Outliers:
Outliers are objects that do not belong to any cluster or form clusters of very small cardinality
In some applications we are interested in discovering outliers, not clusters (outlier analysis)
Distance:
LIQUE are
some
example of
grid-based
method.
Clustering Algorithm(K-means):
K-means Algorithm: The K-means algorithm may be described as follows
1. Select the number of clusters. Let this number be K
2. Pick K seeds as centroids of the k clusters. The seeds may be picked randomly unless the user has
some insight into the data.
3. Compute the Euclidean distance of each object in the dataset from each of the centroids.
4. Allocate each object to the cluster it is nearest to base on the distances computer in the previous
step.
5. Compute the centroids of the clusters by computing the means of the attribute values of the objects
in each cluster.
6. Cheek if the stopping criterion has been met(e.g. the cluster membership is unchanged) if yes go to
step 7. If not, go to step 3.
7. [optional] One may decide to stop at this stage or to split a cluster or combine two clusters
heuristically until a stopping criterion is met.
K-means Example:
Consider the data about students. The only attributes are the age and the three marks
Example of K-Mean:
K-Means:
Strengths
Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t
<< n.
Often terminates at a local optimum.
Weaknesses
Applicable only when mean is defined (what about categorical data?)
Need to specify k, the number of clusters, in advance
Trouble with noisy data and outliers
Not suitable to discover clusters with non-convex shapes
K-means Example:
The results of the k-means method depend strongly on the initial guesses of the seeds.
The k-means method can be sensitive to outliers. If an outlier is picked as a starting seed, it may end
up in a cluster of its own. Also if an outlier moves from one cluster to another during iterations, it
can have a major impact on the clusters because the means of the two clusters are likely to change
significantly.
Although some local optimum solutions discovered by the K-means method are satisfactory, often
the local optimum is not as good as the global optimum.
The K-means method does not consider the size of the clusters. Some clusters may be large and
some very small.
The K-means does not deal with overlapping clusters.
Nearest Neighbor Algorithm:
An algorithm similar to the single link technique is called the nearest neighbor algorithm.
With this serial algorithm, items are iteratively merged into the existing clusters that are closet.
In this algorithm a threshold, t is used to determine if items will be added to existing clusters or if a new
cluster is created.
Nearest Neighbor Algorithm Example: