Vtu 7TH Sem Cse/ise Data Warehousing & Data Mining Notes 10cs755/10is74
Vtu 7TH Sem Cse/ise Data Warehousing & Data Mining Notes 10cs755/10is74
Vtu 7TH Sem Cse/ise Data Warehousing & Data Mining Notes 10cs755/10is74
I.A. Marks : 25
Exam Hours: 03
Exam Marks: 100
PART A
UNIT 1
6 Hours
Data Warehousing: Introduction, Operational Data Stores (ODS), Extraction Transformation Loading (ETL), Data Warehouses.
Design Issues, Guidelines for Data Warehouse Implementation, Data Warehouse Metadata
UNIT 2
6 Hours
Online Analytical Processing (OLAP): Introduction, Characteristics of OLAP systems, Multidimensional view and Data cube,
Data Cube Implementations, Data Cube operations, Implementation of OLAP and overview on OLAP Softwares.
UNIT 3
6 Hours
Data Mining: Introduction, Challenges, Data Mining Tasks, Types of Data, Data Preprocessing, Measures of Similarity and
Dissimilarity, Data Mining Applications
UNIT 4
8 Hours
Association Analysis: Basic Concepts and Algorithms: Frequent Itemset Generation, Rule Generation, Compact Representation
of Frequent Itemsets, Alternative methods for generating Frequent Itemsets, FP Growth Algorithm, Evaluation of Association
Patterns
PART - B
UNIT 5
6 Hours
Classification -1 : Basics, General approach to solve classification problem, Decision Trees, Rule Based Classifiers, Nearest
Neighbor Classifiers.
UNIT 6
6 Hours
Classification - 2 : Bayesian Classifiers, Estimating Predictive accuracy of classification methods, Improving accuracy of
clarification methods, Evaluation criteria for classification methods, Multiclass Problem.
UNIT 7
8 Hours
Clustering Techniques: Overview, Features of cluster analysis, Types of Data and Computing Distance, Types of Cluster
Analysis Methods, Partitional Methods, Hierarchical Methods, Density Based Methods, Quality and Validity of Cluster Analysis
UNIT 8
6 Hours
Web Mining: Introduction, Web content mining, Text Mining, Unstructured Text, Text clustering, Mining Spatial and Temporal
Databases.
Text Books:
1. Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Addison-Wesley, 2005.
2. G. K. Gupta: Introduction to Data Mining with Case Studies, 3 rd Edition, PHI, New Delhi, 2009.
TABLE OF CONTENTS
UNIT-1: DATA WAREHOUSING
1-10
11-22
23-40
41-58
59-68
It is literally true that you can succeed best and quickest by helping others to succeed.
1
Its human nature not to appreciate all we have until it's lost.
2
Selfless service is the rent you pay for living on this wonderful planet.
3
Fear is the number one factor that causes people to live small, inauthentic lives.
5
Focus on being great at what u do. Be truly outstanding in every element of ur professional and ur personal life.
6
Stop being a prisoner of your past & become architect of your future.
9
No matter how insignificant the thing you have to do, do it as well as you can. For it will be by those small things
that you shall be judged.
10
What we do have enormous control over is the way we respond to what life sends our way.
11
Actions have consequences and to reap the harvest you dream of, you must sow the seeds.
12
The multidimensional view of data by using an example of a simple OLTP database consists of the
three tables:
student(Student_id, Student_name, Country, DOB, Address)
enrolment(Student_id, Degree_id, SSemester)
degree(Degree_id, Degree_name, Degree_length, Fee, Department)
It is clear that the information given in Tables 8.3, 8.4 and 8.5, although suitable for a student
enrolment OLTP system, is not suitable for efficient management decision making.
The managers do not need information about the individual students, the degree they are enrolled in,
and the semester they joined the university.
What the managers needs the trends in student numbers in different degree programs and from
different countries.
We first consider only two dimensions. Let us say we are primarily interested in finding out how
many students from each country came to do a particular degree (Table: 8.6). Therefore we may
visualize the data as two-dimensional, i.e.,
Country x Degree
Using this two-dimensional view we are able to find the number of students joining any degree from
any country (only for semester 2000-01). Other queries that we are quickly able to answer are:
How many students started BIT in 2000-01?
How many students joined from Singapore in 2000-01?
Tables 8.6, 8.7 and 8.8 together now form a three-dimensional cube. The table 8.7 provides totals
for the two semesters and we are able to drill-down to find numbers in individual semesters.
Putting the three tables together gives a cube of 8 x 6 x 3 (= 144) cells including the totals along
every dimension.
A cube could be represented by:
Country x Degree x Semester
In the three-dimensional cube, the following eight types of aggregations or queries are
possible:
1. null (e.g. how many students are there? Only 1 possible query)
2. degrees (e.g. how many students are doing BSc? 5 possible queries if we assume 5 different
degrees)
3. semester (e.g. how many students entered in semester 2000-01? 2 possible queries if we
only have data about 2 semesters)
4. country (e.g. how many students are from the USA? 7 possible queries if there are 7
countries)
5. degrees, semester (e.g. how many students entered in 2000-01 to enroll in BCom? With 5
degrees and 2 different semesters 10 queries)
6. semester, country (e.g. how many students from the UK entered in 2000-01? With 7
countries and 2 different semesters 14 queries)
7. degrees, country (e.g. how many students from Singapore are enrolled in BCom? 7*5=35
queries)
8. all (e.g. how many students from Malaysia entered in 2000-01 to enroll in BCom? 7*5*2=70
queries)
All the cell in the cube represents measures or aggregations.
2n types of aggregations are possible for n dimensions.
Devote yourself to displaying a standard of excellence at work far higher than anyone would ever expect from you.
16
Asking the right question is often 90% of finding the right answer.
17
Life is short and you dont know when it's going to end.
18
DRILL DOWN
This is like zooming in on the data and is therefore the reverse of roll-up.
This is an appropriate operation when the user needs further details or when the user wants to
partition more finely or wants to focus on some particular values of certain dimensions.
This adds more details to the data.
Initially the concept hierarchy was "day < month < quarter < year."
On drill-up the time dimension is descended from the level quarter to the level of month.
When drill-down operation is performed then one or more dimensions from the data cube are added.
Life's a game. Dont take it too seriously. Have fun. Dance. Laugh. Maintain a healthy dose of perspective.
19
PIVOT OR ROTATE
This is used when the user wishes to re-orient the view of the data cube.
This may involve
swapping the rows and columns, or
moving one of the row dimensions into the column dimension
In this, the item and location axes in 2-D slice are rotated.
There are no great acts, only small acts done with great love.
20
The dice operation is similar to slice but dicing does not involve reducing number of dimensions.
A dice is obtained by performing a selection on two or more dimensions.
The dice operation on cube based on the following selection criteria that involve three dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem").
The input-data is stored in various formats such as flat files, spread sheet or relational tables.
Purpose of preprocessing: to transform the raw input-data into an appropriate format for
subsequent analysis.
The steps involved in data-preprocessing include
combine data from multiple sources
clean data to remove noise & duplicate observations, and
select records & features that are relevant to the DM task at hand
Data-preprocessing is perhaps the most time-consuming step in the overall knowledge discovery
process.
Closing the loop" refers to the process of integrating DM results into decision support systems.
Such integration requires a postprocessing step. This step ensures that only valid and useful
results are incorporated into the decision support system.
An example of postprocessing is visualization.
Visualization can be used to explore data and DM results from a variety of viewpoints.
Statistical measures can also be applied during postprocessing to eliminate bogus DM results.
People with goals succeed because they know where they're going.
24
We all have two choices; We can make a living or we can design a life.
25
Success is not measured in achievement of goals, but in the stress and strain of meeting those goal.
26
Aim for the stars and maybe you'll reach the sky
29
Wise people remind themselves that every day could be their last.
30
One of our biggest regrets on our deathbeds is that we were not reflective enough.
33
Opportunity is always knocking. The problem is that most people have the self-doubt station in their head turned
up way too loud to
34
Winners are those people who make a habit of doing the things losers are uncomfortable doing.
36
where r=parameter.
The following are the three most common examples of minkowski distance:
r=1. City block( Manhattan L1 norm) distance.
A common example is the Hamming distance, which is the number of bits that are different
between two objects that have only binary attributes ie between two binary vectors.
r=2. Euclidean distance (L2 norm)
r=. Supremum(L or Lmax norm) distance. This is the maximum difference between any
attribute of the objects. Distance is defined by
If d(x,y) is the distance between two points, x and y, then the following properties hold
1) Positivity
d(x,x)>=0 for all x and y
d(x,y)=0 only if x=y
2) Symmetry
d(x,y)=d(y,x) for all x and y.
3) Triangle inequality
d(x,z)<=d(x,y)+d(y,z) for all points x,y and z.
Measures that satisfy all three properties are known as metrics.
Let go of the past and go for the future. Go confidently in the direction of your dreams. Live the life you've imagined.
37
Courage means to continue seeking solutions to difficult problems, and to stay focused during stressful periods.
38
As indicates by figure 2.16, cosine similarity really is a measure of the angle between x and y.
Thus, if the cosine similarity is 1,the angle between x and y is 0',and x and y are the same except for
magnitude(length).
If cosine similarity is 0, then the angle between x and y is 90' and they do not share any terms.
One cannot get through life without pain. What we can do is choose how to use the pain life presents to us.
41
Cluster Analysis
One does not know what classes or clusters exist (Table 4.2).
The problem to be solved is to group the given data into meaningful clusters.
The aim of cluster analysis is to find meaningful groups with
small within-group variations &
large between-group variation
Most of the algorithms developed are based on some concept of similarity or distance.
Drawbacks:
This process may be prohibitively expensive for large sets.
Cost of computing distances between groups of objects grows as no. of attributes grows.
Computing distances between categorical attributes is more difficult (compared to
computing distances between objects with numeric attributes)
There are no secrets to success. It is the result of preparation, hard work, and learning from failure.
42
Do your work with your whole heart and you will succeed - there's so little competition.
43
No matter what we want of life we have to give up something in order to get it.
44
Partitional Method
The objects are divided into non-overlapping clusters (or partitions)
such that each object is in exactly one cluster (Figure 4.1a).
The method obtains a single-level partition of objects.
The analyst has
to specify number of clusters(k) prior
to specify starting seeds of clusters
The analyst may have to use iterative approach in which he has to run the method many times
specifying different numbers of clusters & different starting seeds
then selecting the best solution
The method converges to a local minimum rather than the global minimum.
Figure 4.1a
Figure 4.1b
Hierarchical Methods
A set of nested clusters is organized as a hierarchical tree (Figure 4.1b).
The method either
starts with one cluster & then splits into smaller clusters (called divisive or top down) or
starts with each object in an individual cluster & then tries to merge similar clusters into
larger clusters(called agglomerative or bottom up)
Tentative clusters may be merged or split based on some criteria.
Density based Methods
A cluster is a dense region of points, which is separated by low-density regions, from other regions of
high density.
Typically, for each data-point in a cluster, at least a minimum number of points must exist within a
given radius.
The method can deal with arbitrary shape clusters (especially when noise & outliers are present).
Grid-based methods
The object-space rather than the data is divided into a grid.
This is based on characteristics of the data.
The method can deal with non-numeric data more easily.
The method is not affected by data-ordering.
Model-based Methods
A model is assumed, perhaps based on a probability distribution.
Essentially, the algorithm tries to build clusters with
a high level of similarity within them
a low level of similarity between them.
Similarity measurement is based on the mean values and the algorithm tries to minimize the squared
error function.
Your chances of success in any undertaking can always be measured by your belief in yourself.
45
Figure 4.1c
The state of your life is nothing more than a reflection of your state of mind.
46
Step 1 and 2:
Let the three seeds be the first three students as shown in Table 4.4.
Step 3 and 4:
Now compute the distances using the 4 attributes and using the sum of absolute differences for
simplicity. The distance values for all the objects are given in Table 4.5.
Step 5:
Table 4.6 compares the cluster means of clusters found in Table 4.5 with the original seeds.
Do not wait; the time will never be just right. Start where you stand, and work with whatever tools you may have
at your command, and better tools will be found as you go along.
47
Number of students in cluster 1 is again 2 and the other two clusters still have 4 students each.
A more careful look shows that the clusters have not changed at all. Therefore, the method has
converged rather quickly for this very simple dataset.
The cluster membership is as follows
Cluster C1= {S1, S9}
Cluster C2= {S2, S5, S6, S10}
Cluster C3= {S3, S4, S7, S8}
SCALING AND WEIGHTING
For clustering to be effective, all attributes should be converted to a similar scale.
There are a number of ways to transform the attributes.
One possibility is to transform the attributes
to a normalized score or
to a range(0,1)
Such transformations are called scaling.
Some other approaches are:
1) Divide each attribute by the mean value of that attribute.
This approach reduces the mean of each attribute to 1.
This does not control the variation; some values may still be large, others small.
2) Divide each attribute by the difference between largest-value and smallest-value.
This approach
decreases the mean of attributes that have a large range of values &
increases the mean of attributes that have small range of values.
3) Convert the attribute values to "standardized scores" by subtracting the mean of the
attribute from each attribute value and dividing it by the standard deviation.
Now, the mean & standard-deviation of each attribute will be 0 and 1 respectively.
SUMMARY OF THE K MEANS METHOD
K means is an iterative improvement greedy method.
A number of iterations are normally needed for convergence and
therefore the dataset is processed a number of times.
Following are number of issues related to the method (Disadvantages)
1) The results of the method depend strongly on the initial guesses of the seeds. Need to
specify k, the number of clusters, in advance.
2) The method can be sensitive to outliers. If an outlier is picked as a starting seed, it may endup in a cluster at its own.
3) The method does not consider the size of the clusters.
4) The method does not deal with overlapping clusters.
5) Often, the local optimum is not as good as the global optimum.
6) The method implicitly assumes spherical probability distribution.
7) The method needs to compute Euclidean distances and means of the attribute values of
objects within a cluster. (i.e. Cannot be used with categorical data).
Success is never final and failure never fatal. It is courage that counts.
48
Figure 4.1d
Figure 4.1e
2. Divisive approach: All the objects are put in a single cluster to start. The method then
repeatedly resulting in smaller clusters until a stopping criterion is reached or each cluster has
only one object in it (Figure 4.1f).
Figure 4.1f
It is not the strongest of the species that survives, nor the most intelligent, but the one most responsive to change.
49
COMPLETE-LINK
The distance between 2 clusters is defined as the maximum of the distances between all pairs of
points(x,y). i.e. D(x, y)=max(xi, yj)
This is strongly biased towards compact clusters (Figure 4.3).
Disadvantages:
Each cluster may have an outlier and the 2 outliers may be very far away and so the
distance between the 2 clusters would be computed to be large.
If a cluster is naturally of a shape, say, like a banana then perhaps this method is not
appropriate.
CENTRIOD
The distance between 2 clusters is defined as the distance between the centroids of the clusters.
Usually, the squared Euclidean distance between the centroids is used. i.e. D(x, y)=D(C i, Cj)
Advantages:
This method is easy and generally works well.
This method is more tolerant of somewhat longer clusters than complete link algorithm.
Time is our most valuable asset, yet we tend to waste it, kill it, and spend it rather than invest it.
50
Pay no attention to what the critics say; there has never been set up a statue in honor of a critic.
51
Steps 1 and 2:
Allocate each point to a cluster and compute the distance-matrix using the centroid method.
The distance-matrix is symmetric, so we only show half of it in Table 4.11.
Table 4.11 gives the distance of each object with every other object.
Steps 3 and 4:
The smallest distance is 8 between objects S 4 and S8. They are combined and removed and we put
the combined cluster (C1) where the object S4 was.
Table 4.12 is now the new distance-matrix. All distances except those with cluster C 1 remain
unchanged.
Our very business in life is not to get ahead of others... but to get ahead of ourselves.
52
Steps 5 and 6:
The smallest distance now is 15 between objects S 5 and S6. They are combined in C2 cluster and S5
and S6 are removed.
Steps 3, 4, 5 and 6: Table 4.13 is the updated distance-matrix.
The result of using the agglomerative method could be something like that shown in Figure 4.6.
The largest distance is 115 between objects S 8 and S9. They become the seeds of 2 new clusters.
K means is used to split the group into 2 clusters.
When your desires are strong enough you will appear to possess superhuman powers to achieve.
54
The distance-matrix of C1 is given in Table 4.17. The largest distance is 98 between S 8 and S10. C1
can therefore be split with S8 and S10 as seeds.
The method continues like this until the stopping criteria is met.
SUMMARY OF HIERARCHICAL METHODS
Advantages
1) This method is conceptually simpler and can be implemented easily.
2) In some applications, only proximity-data is available and then this method may be better.
3) This method can provide clusters at different levels of granularity.
4) This method can provide more insight into data by showing a hierarchy of clusters (than a
flat cluster structure created by a partitioning method like the K-means method).
5) Do not have to assume any particular number of clusters.
Disadvantage
1) Do not scale well: Time complexity of at least O(n2), where n is number of total objects.
2) The distance-matrix requires O(n2) space and becomes very large for a large number of
objects.
3) Different distance metrics and scaling of data can significantly change the results.
4) Once a decision is made to combine two clusters, it cannot be undone.
If you dont set goals for yourself, you are doomed to work to achieve the goals of someone else.
55
Plan for the future, because that is where you are going to spend the rest of your life.
56
EXERCISES
1) What is cluster analysis? What are its applications? (2)
2) Compare classification vs. cluster analysis. (6)
3) List out and explain desired features of cluster analysis method. (6)
4) List out and explain different types of data. (4)
5) List out and explain different distance measures. (4)
6) List out and explain different types of cluster analysis methods. (6)
7) Write algorithm for k-means method. (6)
8) Apply k-means method for clustering the data given in Table 4.3. (6)
9) List out disadvantages of k-means method. (6)
10) Explain scaling and weighting. (4)
11) Explain expectation maximization method. (4)
12) Compare agglomerative approach vs. divisive approach. (4)
13) Explain different methods used for computing distances b/t clusters. (6)
14) Write algorithm for agglomerative approach. (6)
15) Apply agglomerative technique for clustering data given in Table4.10. (6)
16) Write algorithm for divisive approach. (6)
17) List out advantages and disadvantages of hierarchical methods. (6)
18) Explain DBSCAN with its algorithm. (6)
19) Explain K means method for large databases. (4)
20) Explain hierarchical method for large databases. (6)
21) Explain quality and validity of cluster analysis methods (6)
A man is a success if he gets up in the morning and goes to bed at night and in between does what he wants to do.
58
You can do what you think you can do and you cannot do what you think you cannot do.
60
This type of mining can be performed either at the (intra-page) document level or at the (interpage) hyperlink level.
This can be used to classify web-pages.
This can be used to generate information such as the similarity & relationship between different
web-sites.
PageRank
PageRank is a metric for ranking hypertext documents based on their quality.
The key idea is that a page has a high rank if it is pointed to by many highly ranked pages
The PageRank of a page A is given by
Here, n = number of nodes in the graph and
OutDegree(q) = number of hyperlinks on page q
d = damping factor which can be set between 0 and 1 and is usually set to 0.85
Clustering & Determining Similar Pages
For determining the collection of similar pages, we need to define the similarity measure
between the pages. There are 2 basic similarity functions:
1) Co-citation
For a pair of nodes p and q, the co-citation is the number of nodes that point to both p
and q.
2) Bibliographic coupling
For a pair of nodes p and q, the bibliographic coupling is equal to the number of nodes
that have links from both p and q.
Social Network Analysis
This can be used to measure the relative standing or importance of individuals in a network.
The basis idea is that if a web-page points a link to another web-page, then the former is,
in some sense, endorsing the importance of the latter.
Links in the network may have different weights, corresponding to the strength of
endorsement.
People with small minds talk about other people. People with average minds talk about events.
People with great minds talk about ideas,
61
One important key to success is self-confidence. An important key to self- confidence is preparation.
62
LSI
LSI stands for Latent Semantic Indexing.
LSI is an indexing and retrieval method to identify the relationships between the
terms and concepts contained in an unstructured collection of text.
The future belongs to those who believe in the beauty of the dream.
64
If you want to reach a goal, you must "see the reaching" in your own mind before you actually arrive at your goal.
65
It is literally true that you can succeed best and quickest by helping others to succeed.
66