DWDMunit 2
DWDMunit 2
DWDMunit 2
UNIT-2
Knowledge discovery from Data (KDD) is essential for data mining. While
others view data mining as an essential step in the process of knowledge
discovery. Here is the list of steps involved in the knowledge discovery
process −
• Data Cleaning − In this step, the noise and inconsistent data is removed.
• Data Integration − In this step, multiple data sources are combined.
• Data Selection − In this step, data relevant to the analysis task are retrieved
from the database.
• Data Transformation − In this step, data is transformed or consolidated
into forms appropriate for mining by performing summary or aggregation
operations.
• Data Mining − In this step, intelligent methods are applied in order to
extract data patterns.
• Pattern Evaluation − In this step, data patterns are evaluated.
• Knowledge Presentation − In this step, knowledge is represented.
VLITS Page 1
II(CAI,CSM,AIML,CSD)
VLITS Page 2
II(CAI,CSM,AIML,CSD)
VLITS Page 3
II(CAI,CSM,AIML,CSD)
1. Statistics:
It uses the mathematical analysis to express representations, model and
summarize empirical data or real world observations.
Statistical analysis involves the collection of methods, applicable to large
amount of data to conclude and report the trend.
2. Machine learning
Arthur Samuel defined machine learning as a field of study that gives
computers the ability to learn without being programmed.
When the new data is entered in the computer, algorithms help the data to
grow or change due to machine learning.
VLITS Page 4
II(CAI,CSM,AIML,CSD)
VLITS Page 5
II(CAI,CSM,AIML,CSD)
On the basis of the kind of data to be mined, there are two categories of
functions involved in Data Mining −
a) Descriptive
b) Classification and Prediction
a) DescriptiveFunction
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
1. Class/Concept Description
2. Mining of Frequent Patterns
3. Mining of Associations
4. Mining of Correlations
5. Mining of Clusters
1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or
concepts. For example, in a company, the classes of items for sales include
computer and printers, and concepts of customers include big spenders and
budget spenders. Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by the
VLITS Page 6
II(CAI,CSM,AIML,CSD)
4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting
statistical correlations between associated-attribute-value pairs or between
two item sets to analyze that if they have positive, negative or no effect on
each other.
5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis
refers to forming group of objects that are very similar to each other but are
highly different from the objects in other clusters.
b) Classification and Prediction
Classification is the process of finding a model that describes the data
classes or concepts. The purpose is to be able to use this model to predict the
class of objects whose class label is unknown. This derived model is based
VLITS Page 7
II(CAI,CSM,AIML,CSD)
on the analysis of sets of training data. The derived model can be presented
in the following forms −
6. Outlier Analysis − Outliers may be defined as the data objects that do not
comply with the general behavior or model of the data available.
VLITS Page 8
II(CAI,CSM,AIML,CSD)
Flat Files
• Flat files are defined as data files in text form or binary
form with a structure that can be easily extracted by
data mining algorithms.
• Data stored in flat files have no relationship or path
among themselves, like if a relational database is stored
on flat file, and then there will be no relations between
the tables.
• Flat files are represented by data dictionary. Eg: CSV file.
• Application: Used in Data Warehousing to store data,
Used in carrying data to and from server, etc.
Relational Databases
• A Relational database is defined as the collection of
data organized in tables with rows and columns.
• Physical schema in Relational databases is a schema
which defines the structure of tables.
• Logical schema in Relational databases is a schema
which defines the relationship among tables.
• Standard API of relational database is SQL.
• Application: Data Mining, ROLAP model, etc.
DataWarehouse
• A datawarehouse is defined as the collection of
data integrated from multiple sources that will
query and decision making.
• There are three types of datawarehouse:
Enterprise datawarehouse, Data Mart and Virtual
Warehouse.
VLITS Page 9
II(CAI,CSM,AIML,CSD)
VLITS Page 10
II(CAI,CSM,AIML,CSD)
1. Choice of Distance Metric: The choice of distance metric significantly impacts the
results of proximity calculations. Common distance metrics include Euclidean
distance, Manhattan distance, and cosine similarity. However, selecting an
inappropriate metric for the given data can lead to inaccurate proximity measures.
VLITS Page 11
II(CAI,CSM,AIML,CSD)
Different choices can lead to different proximity measures and ultimately affect the
analysis outcomes.
9. Boundary Effects: Proximity calculations near the boundaries of the dataset can be
problematic. Depending on the method used, the proximity of points near the edges of
the dataset may be underestimated, leading to biased results.
10.Temporal Dynamics: In dynamic datasets where objects or entities change over time,
maintaining accurate proximity calculations requires accounting for temporal
dynamics. Failing to consider temporal changes can lead to outdated or irrelevant
proximity measures.
1. Feature Construction:
o Interaction Features: Creating new features by combining existing features to
capture interactions or relationships between them.
o Derived Features: Generating derived features based on domain knowledge or
insights about the data. For example, calculating ratios, differences, or averages
of numerical variables.
o Temporal Features: Extracting time-related features such as day of the week,
month, season, or time since a specific event occurred.
2. Feature Aggregation:
o Group Statistics: Calculating summary statistics (e.g., mean, median, standard
deviation) of numerical features within groups defined by categorical variables.
o Temporal Aggregations: Aggregating temporal data into higher-level intervals
(e.g., hourly, daily, weekly) and computing statistics within each interval.
3. Feature Encoding:
o Target Encoding: Encoding categorical variables based on the target variable's
mean or frequency within each category.
o Frequency Encoding: Encoding categorical variables based on their frequency
or occurrence in the dataset.
o Binary Encoding: Converting categorical variables into binary representations
using techniques like one-hot encoding or binary hashing.
VLITS Page 12
II(CAI,CSM,AIML,CSD)
Attribute Values
Colors Black, Green, Brown,
red
2. Binary Attributes: Binary data has only 2 values/states. For
Example yes or no, affected or unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).
VLITS Page 13
II(CAI,CSM,AIML,CSD)
reference point or we can call zero point. Data can be added and
subtracted at interval scale but cannot be multiplied or
divided.Consider an example of temperature in degrees Centigrade.
If a day’s temperature of one day is twice than the other day we
cannot say that one day is twice
as hot as another day.
i. A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a
measurement is ratio-scaled, we can say of a value as being a multiple
(or ratio) of another value. The values are ordered, and we can also
compute the difference between values, and the mean, median, mode,
Quantile-range and five number summaries can be given.
5. Discrete: Discrete data have finite values it can be numerical and
can also be in categorical form. These attributes has finite or
countably infinite set of values.
Example
Attribute Values
Profession Teacher, Business man, Peon
VLITS Page 14
II(CAI,CSM,AIML,CSD)
Indeed, the difficulties associated with analyzing high- dimensional data are
sometimes referred to as the curse of dimensionality. Because of this,an
important motivation in preprocessing the data is dimensionality reduction.
Sparsity For some data sets, such as those with asymmetric features, most
attributes of an object have values of 0; in many cases, fewer than 1% of the
entries are non-zero.
l Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
l Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
l Proximity refers to a similarity or dissimilarity
VLITS Page 15
II(CAI,CSM,AIML,CSD)
VLITS Page 16
II(CAI,CSM,AIML,CSD)
1. Sampling:
Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller random
sample (or subset) of the data. Suppose that a large data set D, contains
N tuples, then the possible samples are Simple Random sample without
Replacement (SRS WOR) of size n: This is created by drawing „n‟ of
the „N‟ tuples from D (n<N), where the probability of drawing any
tuple in D is 1/N, i.e., all tuples are equally likely to be sampled.
Record Data:
Much data mining work assumes that the data set is a collection of records (data objects),
each of which consists of a fixed set of data fields (attributes). For the most basic form of
record data, there is no explicit relationship among records or data fields, and every
record (object) has the same set of attributes. Record data is usually stored either in flat
files or in relational databases.
VLITS Page 17
II(CAI,CSM,AIML,CSD)
are the items. This type of data is called market basket data because the items
in each record are the products in a person’s “market basket.”
The Data Matrix If the data objects in a collection of data all have the
same fixed set of numeric attributes, then the data objects can be thought of as
points (vectors) in a multidimensional space, where each dimension represents a
distinct attribute describing the object. A set of such data objects can be
interpreted as an m by n matrix, where there are m rows, one for each object,
and n columns, one for each attribute. This matrix is called a data matrix or a
pattern matrix.
The Sparse Data Matrix A sparse data matrix is a special case of a data matrix
in which the attributes are of the same type and are asymmetric; i.e., only non-
zero values are important. Transaction data is an example of a sparse data matrix
that has only 0–1 entries. Another common example is document data. If the
order of the terms (words) in a document is ignored, then a document can be
represented as a term vector, where each term isa component (attribute) of
VLITS Page 18
II(CAI,CSM,AIML,CSD)
the vector and the value of each component isthe number of times the
corresponding term occurs in the document. This representation of a collection of
documents is often called a document-term matrix.
Ordered Data
For some types of data, the attributes have relationships that involve order in
time or space. Sequential Data Sequential data, also referred to as temporal
data, canbe thought of as an extension of record data, where each record has a
time associated with it. Consider a retail transaction data set that also stores the
time at which the transaction took place. This time information makes it possible
to find patterns such as “candy sales peak before Halloween.” example of
sequential transaction data. There are five different times—t1, t2, t3, t4, and t5 ;
three different customers—C1, C2, and C3; and five different items—A, B, C,
D, and E. In the top table, each row corresponds to the items purchased at a
particular time by each customer. For instance, at time t3, customer C2
purchased items A and D. In the bottom table, the same information is displayed,
but each row corresponds to a particular customer. Each row contains
information on each transaction involving the customer, where a transaction is
considered to be a set of items and the time at which those items were purchased.
For example, customer C3 bought items A and C at time t2.
VLITS Page 19
II(CAI,CSM,AIML,CSD)
b) Z-Score Normalization
VLITS Page 20
II(CAI,CSM,AIML,CSD)
The values for an attribute, A, are normalized based on the mean (i.e.,
average) and standard deviation of A. A value, vi, of A is normalized to
vi’ by computing
where𝐴 and A are the mean and standard deviation,
respectively, of attribute A. Example z-score normalization. Suppose
that the mean and standard deviation of the values for the attribute
income are $54,000 and $16,000, respectively. With z-score
normalization, a value of $73,600 for income is transformed to
What is the need of dimensionality reduction? Explain any two techniques for
dimensionality reduction?
Dimensionality Reduction:
In dimensionality reduction, data encoding or transformations are
applied so as to obtained reduced or ―compressed‖ representation of the
oriental data.
Dimension Reduction Types
➢ Lossless - If the original data can be reconstructed from the
VLITS Page 21
II(CAI,CSM,AIML,CSD)
VLITS Page 22
II(CAI,CSM,AIML,CSD)
In the above figure, Y1 and Y2, for the given set of data originally
VLITS Page 23
II(CAI,CSM,AIML,CSD)
mapped to the axes X1 and X2. This information helps identify groups
or patterns within the data. The sorted axes are such that the first axis
shows the most variance among the data, the second axis shows the next
highest variance, and so on.
• The size of the data can be reduced by eliminating the weaker components.
Advantage of PCA
• PCA is computationally inexpensive
• Multidimensional data of more than two dimensions can be
handled by reducing the problem to two dimensions.
• Principal components may be used as inputs to multiple regression and cluster
analysis.
Discuss in detail about data transformation with suitable examples. Suppose that a
hospital tested the age and body fat data for 18 randomly selected
adults with the following results:
age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
a. Normalize the two attributes based on z-score normalization.
Data transformation is a crucial step in data preprocessing that involves converting raw
data into a more suitable format for analysis, visualization, or modeling. One common
technique for data transformation is z-score normalization, also known as standardization.
Z-score normalization standardizes the data by subtracting the mean and dividing by the
standard deviation. This process transforms the data into a standard normal distribution
with a mean of 0 and a standard deviation of 1.
Let's go through the process of normalizing the age and body fat percentage data using z-
score normalization:
VLITS Page 24
II(CAI,CSM,AIML,CSD)
oFor each age value (age_i) and body fat percentage value (fat_i), calculate the z-
score using the formulas:
▪ z_age = (age_i - μ_age) / σ_age
▪ z_fat = (fat_i - μ_fat) / σ_fat
o Repeat this calculation for all age and body fat percentage values in the dataset.
3. Normalized Data:
o After normalization, each attribute will have a mean of 0 and a standard
deviation of 1.
Age (years): 23, 23, 27, 27, 39, 41, 47, 49, 50, 52, 54, 54, 56, 57, 58, 58, 60, 61 Body
Fat Percentage (%): 9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2, 31.2, 34.6, 42.5, 28.8,
33.4, 30.2, 34.1, 32.9, 41.2, 35.7
Using the formulas mentioned above, we calculate the mean and standard deviation
for age and body fat percentage. Then, we normalize each data point using these
values to obtain z-scores.
Once normalized, the data will be transformed into a standard normal distribution,
allowing for easier comparison and analysis across different attributes.
oFor age:
▪ Mean (μ_age) = 45.6
▪ Standard Deviation (σ_age) = 11.421
o For body fat percentage:
▪ Mean (μ_fat) = 27.9167
▪ Standard Deviation (σ_fat) = 8.457
2. Normalize the Data:
o For each age value (age_i), calculate the z-score using the formula:
z_age = (age_i - μ_age) / σ_age
o For each body fat percentage value (fat_i), calculate the z-score using
the formula: z_fat = (fat_i - μ_fat) / σ_fat
3. Normalized Data:
o After performing the calculations, we obtain the normalized values for
age and body fat percentage.
VLITS Page 25
II(CAI,CSM,AIML,CSD)
Age (years): -1.965, -1.965, -1.393, -1.393, -0.049, 0.374, 1.226, 1.652, 1.856, 2.282,
2.708, 2.708, 3.134, 3.36, 3.586, 3.586, 3.812, 4.038
Body Fat Percentage (%): -1.336, 0.364, -1.579, -0.724, 1.101, 0.263, 0.486, 0.437,
1.024, 1.569, 2.657, 0.005, 1.376, 0.794, 1.428, 1.101, 2.065, 1.679
These normalized values have a mean of 0 and a standard deviation of 1, making them
suitable for further analysis or modelling.
VLITS Page 26
II(CAI,CSM,AIML,CSD)
Addressing these issues requires a careful understanding of the problem domain, appropriate
algorithm selection, and consideration of the characteristics of the data involved.
Additionally, advancements in machine learning and data analysis techniques continue to
provide solutions and improvements in handling proximity-related challenges.
VLITS Page 27