DM Mod2 PDF
DM Mod2 PDF
DM Mod2 PDF
1.3.1 A data warehouse can be built using a top-down approach, a bottom-up approach, or a
combination of both.
The top-down approach starts with the overall design and planning. It is
useful in cases where the technology is mature and well known, and where
the business problems that must be solved are clear and well understood.
In the combined approach, an organization can exploit the planned and strategic nature
of the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.
There are three kinds of dataware house applications: information processing, analytical
processing, and data mining.
Information processing supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts, or graphs. A current trend in data warehouse information processing is
to construct low-cost web-based accessing tools that are then integrated with web browsers.
High quality of data in data warehouses: Most data mining tools need to work on
integrated, consistent, and cleaned data, which requires costly data cleaning, data
integration, and data transformation as preprocessing steps. A data warehouse constructed
by such preprocessing serves as a valuable source of high-quality data for OLAP as well
as for data mining. Notice that data mining may serve as a valuable tool for data cleaning
and data integration as well.
Online selection of data mining functions: Users may not always know the specific
kinds of knowledge they want to mine. By integrating OLAP with various data mining
functions, multidimensional data mining provides users with the flexibility to select
desired data mining functions and swap data mining tasks dynamically.
Data warehouses contain huge volumes of data. OLAP servers demand that decision support
queries be answered in the order of seconds. Therefore, it is crucial for data warehouse systems
to support highly efficient cube computation techniques, access methods, and query processing
techniques. In this section, we present an overview of methods for the efficient implementation
of data warehouse systems.
At the core of multidimensional data analysis is the efficient computation of aggregations across
many sets of dimensions. In SQL terms, these aggregations are referred to as group-by’s. Each
group-by can be represented by a cuboid, where the set of group-by’s forms a lattice of cuboids
defining a data cube.
What is the total number of cuboids, or group-by’s, that can be computed for this data cube?
Taking the three attributes, city, item, and year, as the dimensions for the data cube, and sales in
dollars as the measure, the total number of cuboids, or groupby’s, that can be computed for this
data cube is 23 =8. The possible group-by’s are the following: {(city, item, year), (city, item),
(city, year), (item, year), (city), (item), (year), ()}, where () means that the group-by is empty
(i.e., the dimensions are not grouped). These group-by’s form a lattice of cuboids for the data
cube, as shown in the figure above
For a cube with n dimensions, there are a total of 2n cuboids, including the base cuboid. A
statement such as compute cube sales cube would explicitly instruct the system to compute the
sales aggregate cuboids for all eight subsets of the set {city, item, year}, including the empty
subset.
Online analytical processing may need to access different cuboids for different queries.
Therefore, it may seem like a good idea to compute in advance all or at least some of the cuboids
in a data cube. Precomputation leads to fast response time and avoids some redundant
computation
A major challenge related to this precomputation, however, is that the required storage space
may explode if all the cuboids in a data cube are precomputed, especially when the cube has
many dimensions. The storage requirements are even more excessive when many of the
dimensions have associated concept hierarchies, each with multiple levels. This problem is
referred to as the curse of dimensionality
There are three choices for data cube materialization given a base cuboid:
2. Full materialization: Precompute all of the cuboids. The resulting lattice of computed
cuboids is referred to as the full cube. This choice typically requires huge amounts of memory
space in order to store all of the precomputed cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole set of possible
cuboids. Alternatively, we may compute a subset of the cube, which contains only those cells
that satisfy some user-specified criterion, such as where the tuple count of each cell is above
some threshold. We will use the term sub cube to refer to the lattercase,where only some of the
cells may be precomputed for various cuboids. Partial materialization represents an interesting
trade-off between storage space and response time.
The bitmap indexing method is popular in OLAP products because it allows quick searching in
data cubes. The bitmap index is an alternative representation of the record ID (RID) list. In the
bitmap index for a given attribute, there is a distinct bit vector, Bv, for each value v in the
attribute’s domain. If a given attribute’s domain consists of n values, then n bits are needed for
each entry in the bitmap index (i.e., there are n bit vectors). If the attribute has the value v for a
given row in the data table, then the bit representing that value is set to 1 in the corresponding
row of the bitmap index. All other bits for that row are set to 0.
The join indexing method gained popularity from its use in relational database query
processing. Traditional indexing maps the value in a given column to a list of rows having that
value. In contrast, join indexing registers the joinable rows of two relations from a relational
database.
In data warehouses, join index relates the values of the dimensions of a start schema to
rows in the fact table.
E.g. fact table: Sales and two dimensions city and product
A join index on city maintains for each distinct city a list of R-IDs of the
tuples recording the Sales in the city
Join indices can span multiple dimensions
ROLAP works directly with relational databases. The base data and
the dimension tables are stored as relational tables and new tables
are created to hold the aggregated information. It depends on a
specialized schema design.
This methodology relies on manipulating the data stored in the
relational database to give the appearance of traditional OLAP's
slicing and dicing functionality. In essence, each action of slicing
and dicing is equivalent to adding a "WHERE" clause in the SQL
statement.
ROLAP tools do not use pre-calculated data cubes but instead pose
the query to the standard relational database and its tables in order
to bring back the data required to answer the question.
ROLAP tools feature the ability to ask any question because the
methodology does not limit to the contents of a cube. ROLAP also
has the ability to drill down to the lowest level of detail in the
database.
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
MOLAP tools have a very fast response time and the ability to
quickly write back data into the data set.
Question Bank
Module-2
Commercial Viewpoint:
Lots of data is being collected and warehoused
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 1
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Targeted marketing:
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 2
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Preprocessing
The input data can be stored in a variety of formats (flat files, spreadsheets, or relational
tables) and may reside in a centralized data repository or be distributed across multiple
sites.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 3
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
The purpose of preprocessing is to transform the raw input data into an appropriate
format for subsequent analysis.
Data preprocessing include fusing data from multiple sources, cleaning data to remove
noise and duplicate observations, and selecting records and features that are relevant to
the data mining task at hand.
Post processing:
Ensures that only valid and useful results are incorporated into the system.
Which allows analysts to explore the data and the data mining results from a variety of
viewpoints.
Testing methods can also be applied during post processing to eliminate spurious data
mining results.
Motivating Challenges
The following are some of the specific challenges that motivated the development of data
mining.
Scalability :
Because of advances in data generation and collection, data sets with sizes of gigabytes,
terabytes, or even petabytes are becoming common.
If data mining algorithms are to handle these massive data sets, then they must be scalable.
Scalability may also require the implementation of novel data structures to access individual
records in an efficient manner.
High Dimensionality:
It is now common to encounter data sets with hundreds or thousands of attributes.
Traditional data analysis techniques that were developed for low-dimensional data often do not
work well for such high dimensional data.
Heterogeneous and Complex Data:
Recent years have also seen the emergence of more complex data objects, such as
• Collections of Web pages containing semi-structured text and hyperlinks;
• DNA data with sequential and three-dimensional structure;
• climate data that consists of time series measurements (temperature, pressure,
etc.) various locations on the Earth's surface .
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 4
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Techniques developed for mining such complex objects should take into consideration
relationships in the data, such as temporal and spatial autocorrelation, graph connectivity, and
parent-child relationships between the elements in semi-structured text.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 5
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation/Anamoly Detection [Predictive]
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as
possible.
A test set is used to determine the accuracy of the model. Usually, the given data
set is divided into training and test sets, with training set used to build the model
and test set used to validate it.
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone
product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided otherwise. This {buy,
don’t buy} decision forms the class attribute.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 6
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-holder as
attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an account
Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a competitor.
Approach:
Use detailed record of transactions with each of the past and present
customers, to find attributes.
How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
Clustering: Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them,
find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
• Euclidean Distance if attributes are continuous.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 7
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Market Segmentation:
Goal: Subdivide a market into distinct subsets of customers where any subset may conceivably
be selected as a market target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical and lifestyle related
information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of customers in same cluster
vs. those from different clusters.
Clustering: Application 2
Document Clustering
Goal: To find groups of documents that are similar to each other based on the important terms
appearing in them.
Approach:
To identify frequently occurring terms in each document. Form a similarity measure
based on the frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a new document or search term to
clustered documents
Association Rule Discovery: Definition
Given a set of records each of which contain some number of items from a given collection;
Produce dependency rules which will predict occurrence of an item based on occurrences of
other items
TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Beer, Bread
3 Beer, Coke, Diaper, Milk {Milk} --> {Coke}
4 Beer, Bread, Diaper, Milk {Diaper, Milk} --> {Beer}
5 Coke, Diaper, Milk
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 8
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 9
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Problems: Discuss whether or not each of the following activities is a data mining task.
(a) Dividing the customers of a company according to their gender.
No. This is a simple database query.
(b) Dividing the customers of a company according to their profitability.
No. This is an accounting calculation, followed by the application of a threshold. However,
predicting the profitability of a new customer would be data mining.
(c) Computing the total sales of a company.
No. Again, this is simple accounting.
(d) Sorting a student database based on student identification numbers.
No. Again, this is a simple database query.
(e) Predicting the outcomes of tossing a (fair) pair of dice.
No. Since the die is fair, this is a probability calculation. If the die were not fair, and we needed
to estimate the probabilities of each outcome from the data, then this is more like the problems
considered by data mining. However, in this specific case, solutions to this problem were
developed by mathematicians a long time ago, and thus, we wouldn’t consider it to be data
mining.
(f) Predicting the future stock price of a company using historical records.
Yes. We would attempt to create a model that can predict the continuous value of the stock price.
This is an example of the area of data mining known as predictive modeling. We could use
regression for this modeling, although researchers in many fields have developed a wide variety
of techniques for predicting time series.
(g) Monitoring the heart rate of a patient for abnormalities.
Yes. We would build a model of the normal behavior of heart rate and raise an alarm when an
unusual heart behavior occurred. This would involve the area of data mining known as anomaly
detection. This could also be considered as a classification problem if we had examples of both
normal and abnormal heart behavior.
(h) Monitoring seismic waves for earthquake activities.
Yes. In this case, we would build a model of different types of seismic wave behavior associated
with earthquake activities and raise an alarm when one of these different types of seismic activity
was observed. This is an example of the area of data mining known as classification.
(i) Extracting the frequencies of a sound wave.
No. This is signal processing.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 10
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
What is Data?
Collection of data objects and their attributes
An attribute is a property or characteristic of an object
o Examples: eye color of a person, temperature, etc.
o Attribute is also known as variable, field, characteristic, or feature
A collection of attributes describe an object
o Object is also known as record, point, case, sample, entity, or instance
o
Attribute Values
Attribute values are numbers or symbols assigned to an attribute
Distinction between attributes and attribute values
o Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values
o Example: Attribute values for ID and age are integers
But properties of attribute values can be different
o ID has no limit but age has a maximum and minimum value
Types of Attributes
There are different types of attributes
Nominal
o Examples: ID numbers, eye color, zip codes
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 11
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Ordinal
o Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades,
height in {tall, medium, short}
Interval
o Examples: calendar dates, temperatures in Celsius or Fahrenheit.
Ratio
o Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
The type of an attribute depends on which of the following properties it possesses:
Distinctness: =
Order: < >
Addition: + -
Multiplication: */
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 12
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 13
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Ordered Data
For some types of data, the attributes have relationships that involve order in time or space
Different types of ordered data are
Sequential Data: Also referred to as temporal data, can be thought of as an extension of record
data, where each record has a time associated with it.
Example:A retail transaction data set that also stores the time at which the transaction took place
Sequence Data : Sequence data consists of a data set that is a sequence of individual entities,
such as a sequence of words or letters. It is quite similar to sequential data, except that there are
no time stamps; instead, there are positions in an ordered sequence.
Example: the genetic information of plants and animals can be represented in the form of
sequences of nucleotides that are known as genes.
Time Series Data : Time series data is a special type of sequential data in which each record is a
time series, i.e., a series of measurements taken over time.
Example: A financial data set might contain objects that are time series of the daily prices of
various stocks.
Example,: consider a time series of the average monthly temperature for a city during the years
1982 to 1994
Spatial Data : Some objects have spatial attributes, such as positions or areas, as well as other
types of attributes.
Example: Weather data (precipitation, temperature, pressure) that is collected for a variety of
geographical locations.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 14
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Sparsity:
For some data sets, most attributes of an object have values of 0; in many cases fewer
than 1% of the entries are non-zero.
In practical terms, sparsity is an advantage because usually only the non-zero values need
to be stored and manipulated.
This results in significant savings with respect to computation time and storage.
Resolution:
It is frequently possible to obtain data at different levels of resolution, and often the
properties of the data are different at different resolutions.
For instance, the surface of the Earth seems very uneven at a resolution of meters, but is
relatively smooth at a resolution of tens of kilometers.
The patterns in the data also depend on the level of resolution.
If the resolution is too fine, a pattern may not be visible or may be buried in noise; if the
resolution is too coarse, the pattern may disappear.
Data Quality:
Data mining applications are often applied to data that was collected for another
purpose, or for future, but unspecified applications.
For that reason ,data mining cannot usually take advantage of the significant benefits of
"addressing quality issues at the source."
Data mining focuses on (1) the detection and correction of data quality problems (called
data cleaning.)
(2) the use of algorithms that can tolerate poor data quality.
Examples of data quality problems:
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 15
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Missing Values:
Reasons for missing values
• Information is not collected (e.g., people decline to give their age and weight)
• Attributes may not be applicable to all cases (e.g., annual income is not applicable
to children)
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 16
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Duplicate Data:
Data set may include data objects that are duplicates, or almost duplicates of one another
Major issue when merging data from heterogeneous sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
Data Preprocessing
Preprocessing steps should be applied to make the data more suitable for data mining
The most important ideas and approaches are
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation
Combining two or more attributes (or objects) into a single attribute (or object)
Purpose:
o Data reduction
o Reduce the number of attributes or objects
o Change of scale
Cities aggregated into regions, states, countries, etc
o More “stable” data
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 17
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Statisticians sample because obtaining the entire set of data of interest is too expensive or
time consuming.
Sampling is used in data mining because processing the entire set of data of interest is too
expensive or time consuming.
The key principle for effective sampling is the following:
using a sample will work almost as well as using the entire data sets, if the
sample is representative
A sample is representative if it has approximately the same property (of
interest) as the original set of data
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item
Sampling without replacement
As each item is selected, it is removed from the population
Sampling with replacement
Objects are not removed from the population as they are selected for the sample.
In sampling with replacement, the same object can be picked up more than once
Stratified sampling
Split the data into several partitions; then draw random samples from each
partition
Dimensionality Reduction:
Purpose:
o Avoid curse of dimensionality
o Reduce amount of time and memory required by data mining algorithms
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 18
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Redundant features
duplicate much or all of the information contained in one or more other
attributes
Example: purchase price of a product and the amount of sales tax paid
Irrelevant features
contain no information that is useful for the data mining task at hand
Example: students' ID is often irrelevant to the task of predicting students'
GPA
Feature Creation
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 19
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Create new attributes that can capture the important information in a data set much more
efficiently than the original attributes
Three general methodologies:
Feature Extraction
domain-specific
Mapping Data to New Space
Feature Construction
combining features
Attribute Transformation
A function that maps the entire set of values of a given attribute to a new set of
replacement values such that each old value can be identified with one of the new values
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 20
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Similarities are higher for pairs of objects that are more alike.
Similarities are usually non-negative and are often between 0 (no similarity) and 1
(complete similarity).
Dissimilarity between two objects is a Numerical measure of how different are two data
objects.
Dissimilarities are lower for more similar pairs of objects.
Minimum dissimilarity is often 0, Upper limit varies
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 21
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.
Example:
Distance Matrix
Minkowski Distance
Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter. Where n is the number of dimensions (attributes) and xk and yk are,
respectively, the kth attributes (components) or data objects x and y.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 22
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
The following are the three most common examples of Minkowski distances.
A common example of this is the Hamming distance, which is just the number of bits that are
different between two binary vectors
2) r = 2. Euclidean distance
3) r . “supremum” (Lmax norm, L norm) distance.
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 23
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
1. Positivity
2. Symmetry
3. Triangle Inequality
If s(x, y) is the similarity between points x and y, then the typical properties of
Let x and y be two objects that consist of n binary attributes. The comparison
of two such objects, i.e., two binary vectors, Leads to the following four
quantities (frequencies:)
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 24
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Cosine Similarity
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 25
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Correlation
The correlation between two data objects that have binary or continuous variables
is a measure of the linear relationship between the attributes of the objects.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 26
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
Problems:
1)Classify the following attributes as binary, discrete, or continuous. Also classify them as
qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more
than one interpretation, so briefly indicate your reasoning if you think there may be some
ambiguity.
(d) Angles as measured in degrees between 0◦ and 360◦. Continuous, quantitative, ratio
(e) Bronze, Silver, and Gold medals as awarded at the Olympics. Discrete, qualitative, ordinal
(f) Height above sea level. Continuous, quantitative, interval/ratio (depends on whether sea level
is regarded as an arbitrary origin)
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 27
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
(h) ISBN numbers for books. (Look up the format on the Web.) Discrete, qualitative, nominal
(ISBN numbers do have order information, though)
(i) Ability to pass light in terms of the following values: opaque, translucent, transparent.
Discrete, qualitative, ordinal
(k) Distance from the center of campus. Continuous, quantitative, interval/ratio (depends)
(m) Coat check number. (When you attend an event, you can often give your coat to someone
who, in turn, gives you a number that you can use to claim your coat when you leave.) Discrete,
qualitative, nominal
2) Compute the Hamming distance and the Jaccard similarity between the following two binary
vectors
x = 0101010001
y = 0100011000
Jaccard Similarity = number of 1-1 matches /( number of bits – number matches) = 2 / 5 = 0.4
4)For the following vectors, x and y, calculate the indicated similarity or distance
measures.
(b) x=: (0, 1,0, 1), y : (1,0, 1,0) cosine, correlation, Euclidean, Jaccard
(e) x = (2, -7,0,2,0, -3) , y : ( -1, 1,- 1,0,0, -1) cosine, correlation
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 28
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
5.What is data mining? Explain various data mining task with examples .
6.What are data and data attributes ? Explain the types and properties of attributes. .
11. What is data quality? What are the dimension that asses the data quality.
13.What is sampling? Explain simple random sampling v/s stratified sampling v/s progressive
sampling.
16.What is similarity and dissimilarity? Explain similarity and dissimilarity measures between
17.Discuss the measures of proximity between objects that involve multiple attribute.
18.Explain the cosine similarity for calculating the similarity of two documents with an example.
.
19.Consider the following vectors. Find a) Simple Matching Co-efficient b) Jaccard Co-efficient
c) Hamming Distance .
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 29
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
20.Distinguish between:
21.For the following vectors find: a) Cosine Similarity b) Correlation c) Jaccard Similarity
X: 3205000200 Y: 1000000102
23,Discuss whether or not each of the following activities is a data mining task.
(f) Predicting the future stock price of a company using historical records.
24. Classify the following attributes as binary, discrete, or continuous. Also classify them as
qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more
than one interpretation, so briefly indicate your reasoning if you think there may be some
ambiguity.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 30
Source diginotes.in
DATA MINING AND DATA WAREHOUSING (15CS651) VI SEM CSE
(h) ISBN numbers for books. (Look up the format on the Web.)
(i) Ability to pass light in terms of the following values: opaque, translucent, transparent.
SANDEEP KUMAR , ASSOCIATE PROFESSOR, DEPT. OF CSE, CAMBRIDGE INSTITUTE OF TECHNOLOGY Page 31
Source diginotes.in