DWDMunit 2

II(CAI,CSM,AIML,CSD)
UNIT-2
Explain data mining as a KDD Process?(OR)Explain the architecture of

datamining?
Data Mining is defined as extracting information from huge sets of data.
In other words, we can say that data mining is the procedure of mining
knowledge from data. The information or knowledge extracted so can be
used for any of the following applications −
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration
Knowledge discovery from Data (KDD) is essential for data mining. While
others view data mining as an essential step in the process of knowledge
discovery. Here is the list of steps involved in the knowledge discovery
process −
• Data Cleaning − In this step, the noise and inconsistent data is removed.
• Data Integration − In this step, multiple data sources are combined.
• Data Selection − In this step, data relevant to the analysis task are retrieved
from the database.
• Data Transformation − In this step, data is transformed or consolidated
into forms appropriate for mining by performing summary or aggregation
operations.
• Data Mining − In this step, intelligent methods are applied in order to
extract data patterns.
• Pattern Evaluation − In this step, data patterns are evaluated.
• Knowledge Presentation − In this step, knowledge is represented.
VLITS Page 1
What are the major issues in data mining?

(OR)
Explain Motivating challenges in data mining?
a. Data mining is a dynamic and fast-expanding field with great strengths.

The major issues can divided into five groups:
a) Mining Methodology
b) User Interaction
c) Efficiency and scalability
d) Diverse Data Types Issues
e) Data mining society
a)Mining Methodology:
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases − Different users may
be interested in different kinds of knowledge. Therefore it is necessary for
data mining to cover a broad range of knowledge discovery task.
Mining knowledge in multidimensional space – when searching for
knowledge in large datasets, we can explore the data in multidimensional
space.
Handling noisy or incomplete data − the data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of
the discovered patterns will be poor.
Pattern evaluation − the patterns discovered should be interesting because
VLITS Page 2
either they represent common knowledge or lack novelty.

b) User Interaction:
Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on
the returned results.
Incorporation of background knowledge − To guide discovery process
and to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered patterns
not only in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks, should
be integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
c) Efficiency and scalability

There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in databases,
data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed
data mining algorithms. These algorithms divide the data into partitions
which is further processed in a parallel fashion. Then the results from the
partitions is merged. The incremental algorithms, update databases without
mining the data again from scratch.
d)Diverse Data Types Issues
Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kind of data.
Mining information from heterogeneous databases and
global information -- The data is available at different data sources on
LAN or WAN. These data source may be structured, semi structured or
VLITS Page 3
unstructured. Therefore mining the knowledge from them adds challenges to

data mining.
e) Data Mining and Society
Social impacts of data mining – With data mining penetrating our everyday
lives, it is important to study the impact of data mining on society.
Privacy-preserving data mining – data mining will help scientific
discovery, business management, economy recovery, and security protection.
Invisible data mining – we cannot expect everyone in society to learn and
master data mining techniques. More and more systems should have data
mining functions built within so that people can perform data mining or use
data mining results simply by mouse clicking, without any knowledge of
data mining algorithms.
Which technologies are used in datamining?

(or)
What is data mining? Explain about the origins of data mining in detail.
1. Statistics:
It uses the mathematical analysis to express representations, model and
summarize empirical data or real world observations.
Statistical analysis involves the collection of methods, applicable to large
amount of data to conclude and report the trend.
2. Machine learning
Arthur Samuel defined machine learning as a field of study that gives
computers the ability to learn without being programmed.
When the new data is entered in the computer, algorithms help the data to
grow or change due to machine learning.
VLITS Page 4
In machine learning, an algorithm is constructed to predict the data from the

available database (Predictive analysis).
It is related to computational statistics.
The four types of machine learning are:
a. Supervised learning
It is based on the classification.
It is also called as inductive learning. In this method, the
desired outputs are included in the training dataset.
b. Unsupervised learning
Unsupervised learning is based on clustering. Clusters are
formed on the basis of similarity measures and desired outputs are not
included in the training dataset.
c. Semi-supervised learning
Semi-supervised learning includes some desired outputs to the training dataset to
generate the appropriate functions. This method generally avoids the large number
of labeled examples (i.e. desired outputs).
d. Active learning
Active learning is a powerful approach in analyzing the data efficiently.
The algorithm is designed in such a way that, the desired
output should be decided by the algorithm itself (the user plays important
role in this type).
3. Information retrieval
Information deals with uncertain representations of the semantics of objects
(text, images).
For example: Finding relevant information from a large document.
4. Database systems and data warehouse

Databases are used for the purpose of recording the data as well as data warehousing.
Online Transactional Processing (OLTP) uses databases for day to day
transaction purpose.
Data warehouses are used to store historical data which helps to take
strategically decision for business.
It is used for online analytical processing (OALP), which helps to analyze the data.
5. PatternRecognition:
Pattern recognition is the automated recognition of patterns and
regularities in data. Pattern recognition is closely related to artificial
intelligence and machine learning, together with applications such as data
mining and knowledge discovery in databases (KDD), and is often used
VLITS Page 5
interchangeably with these terms.

6. Visualization:
It is the process of extracting and visualizing the data in a very clear
and understandable way without any form of reading or writing by
displaying the results in the form of pie charts, bar graphs, statistical
representation and through graphical forms as well.
7. Algorithms:
To perform data mining techniques we have to design best algorithms.
8. High Performance Computing:
High Performance Computing most generally refers to the practice of
aggregating computing power in a way that delivers much higher
performance than one could get out of a typical desktop computer or
workstation in order to solve large problems in science, engineering, or
business.
Explain tasks of datamining? (or)
What kinds of patterns can be mined?
On the basis of the kind of data to be mined, there are two categories of
functions involved in Data Mining −
a) Descriptive
b) Classification and Prediction
a) DescriptiveFunction
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
1. Class/Concept Description
2. Mining of Frequent Patterns
3. Mining of Associations
4. Mining of Correlations
5. Mining of Clusters
1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or
concepts. For example, in a company, the classes of items for sales include
computer and printers, and concepts of customers include big spenders and
budget spenders. Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by the
VLITS Page 6
following two ways −

• Data Characterization − This refers to summarizing data of class under
study. This class under study is called as Target Class.
• Data Discrimination − It refers to the mapping or classification of a class
with some predefined group or class.
2. Mining of Frequent Patterns

Frequent patterns are those patterns that occur frequently in
transactional data. Here is the list of kind of frequent patterns −
• Frequent Item Set − It refers to a set of items that frequently appear
together, for example, milk and bread.
• Frequent Subsequence − A sequence of patterns that occur frequently
such as purchasing a camera is followed by memory card.
• Frequent Sub Structure − Substructure refers to different structural
forms, such as graphs, trees, or lattices, which may be combined with item-
sets or subsequences.
3. Mining of Association
Associations are used in retail sales to identify patterns that are
frequently purchased together. This process refers to the process of
uncovering the relationship among data and determining association rules.
For example, a retailer generates an association rule that shows that
70% of time milk is sold with bread and only 30% of times biscuits are sold
with bread.
4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting
statistical correlations between associated-attribute-value pairs or between
two item sets to analyze that if they have positive, negative or no effect on
each other.
5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis
refers to forming group of objects that are very similar to each other but are
highly different from the objects in other clusters.
b) Classification and Prediction
Classification is the process of finding a model that describes the data
classes or concepts. The purpose is to be able to use this model to predict the
class of objects whose class label is unknown. This derived model is based
VLITS Page 7
on the analysis of sets of training data. The derived model can be presented
in the following forms −
1. Classification (IF-THEN) Rules

2. Prediction
3. Decision Trees
4. Mathematical Formulae
5. Neural Networks
6. Outlier Analysis
7. Evolution Analysis
The list of functions involved in these processes is as follows −

1. Classification − It predicts the class of objects whose class label is
unknown. Its objective is to find a derived model that describes and
distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data object whose class label is well
known.
2. Prediction − It is used to predict missing or unavailable numerical data

values rather than class labels. Regression Analysis is generally used for
prediction. Prediction can also be used for identification of distribution
trends based on available data.
3. Decision Trees − A decision tree is a structure that includes a root node,

branches, and leaf nodes. Each internal node denotes a test on an attribute,
each branch denotes the outcome of a test, and each leaf node holds a class
label.
4. Mathematical Formulae – Data can be mined by using some mathematical formulas.
5. Neural Networks − Neural networks represent a brain metaphor for

information processing. These models are biologically inspired rather than an
exact replica of how the brain actually functions. Neural networks have been
shown to be very promising systems in many forecasting applications and
business classification applications due to their ability to “learn” from the
data.
6. Outlier Analysis − Outliers may be defined as the data objects that do not
comply with the general behavior or model of the data available.
VLITS Page 8
7. Evolution Analysis − Evolution analysis refers to the description and model

regularities or trends for objects whose behavior changes over time.
What kind of data can be mined?

(or)
Expalin types of data?
1. Flat Files
2. Relational Databases
3. DataWarehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)
Flat Files
• Flat files are defined as data files in text form or binary
form with a structure that can be easily extracted by
data mining algorithms.
• Data stored in flat files have no relationship or path
among themselves, like if a relational database is stored
on flat file, and then there will be no relations between
the tables.
• Flat files are represented by data dictionary. Eg: CSV file.
• Application: Used in Data Warehousing to store data,
Used in carrying data to and from server, etc.
Relational Databases
• A Relational database is defined as the collection of
data organized in tables with rows and columns.
• Physical schema in Relational databases is a schema
which defines the structure of tables.
• Logical schema in Relational databases is a schema
which defines the relationship among tables.
• Standard API of relational database is SQL.
• Application: Data Mining, ROLAP model, etc.
DataWarehouse
• A datawarehouse is defined as the collection of
data integrated from multiple sources that will
query and decision making.
• There are three types of datawarehouse:
Enterprise datawarehouse, Data Mart and Virtual
Warehouse.
VLITS Page 9
Two approaches can be used to update data in

DataWarehouse:
Query- driven Approach and Update-
driven Approach.
• Application: Business decision making, Data mining, etc.
Transactional Databases
• Transactional databases are a collection of data
organized by time stamps, date, etc to represent
transaction in databases.
• This type of database has the capability to roll back or
undo its operation when a transaction is not completed
or committed.
• Highly flexible system where users can modify
information without changing any sensitive
information.
• Follows ACID property of DBMS.
• Application: Banking, Distributed systems, Object databases, etc.
Multimedia Databases
• Multimedia databases consists audio, video, images and text
media.
• They can be stored on Object-Oriented Databases.
• They are used to store complex information in pre-specified
formats.
• Application: Digital libraries, video-on demand,
news-on demand, musical database, etc
Spatial Database
• Store geographical information.
• Stores data in the form of coordinates, topology, lines, polygons,
etc.
• Application: Maps, Global positioning, etc.
Time-series Databases
• Time series databases contain stock exchange data and user
logged activities.
• Handles array of numbers indexed by time, date, etc.
• It requires real-time analysis.
• Application: eXtremeDB, Graphite, InfluxDB, etc.
WWW
• WWW refers to World wide web is a collection of
documents and resources like audio, video, text, etc
which are identified by Uniform Resource Locators
(URLs) through web browsers, linked by HTML pages,
and accessible via the Internet network.
• It is the most heterogeneous repository as it collects data from
multiple resources.
VLITS Page 10
• It is dynamic in nature as Volume of data is continuously

increasing and changing.
• Application: Online shopping, Job search, Research, studying,
etc.
What are issues in proximity calculation? Explain?
Proximity calculation, often utilized in various fields such as geography, computer
science, and data analysis, involves determining the closeness or distance between
objects, points, or entities. While proximity calculation can be immensely useful, there
are several issues and challenges associated with it:
1. Choice of Distance Metric: The choice of distance metric significantly impacts the
results of proximity calculations. Common distance metrics include Euclidean
distance, Manhattan distance, and cosine similarity. However, selecting an
inappropriate metric for the given data can lead to inaccurate proximity measures.
2. Dimensionality: Proximity calculations can become increasingly complex as the

dimensionality of the data increases. This is known as the curse of dimensionality. In
high-dimensional spaces, traditional distance metrics may lose their effectiveness, and
specialized techniques such as dimensionality reduction may be necessary to address
this issue.
3. Scalability: As the size of the dataset grows, performing proximity calculations can
become computationally expensive and time-consuming. Efficient algorithms and data
structures are required to handle large-scale proximity calculations effectively.
4. Data Sparsity: In some cases, the data may be sparse, meaning that most data points
are far apart from each other in the feature space. This can pose challenges for
proximity calculations, especially when using distance-based methods.
5. Normalization: Before calculating proximity measures, it's crucial to normalize the
data appropriately. Failure to normalize data can lead to biased proximity calculations,
particularly when dealing with features of different scales.
6. Outliers: Outliers can significantly affect proximity calculations, particularly when
using distance-based methods. Outliers may distort the proximity measures and lead to
misleading results if not handled properly.
7. Interpretability: The interpretation of proximity measures can sometimes be
challenging, especially when dealing with complex datasets or high-dimensional
spaces. Understanding the significance of proximity values in real-world terms is
essential for meaningful analysis.
8. Subjectivity: Proximity calculations may involve subjective decisions, such as
determining the relevance of features or selecting appropriate distance metrics.
VLITS Page 11
Different choices can lead to different proximity measures and ultimately affect the
analysis outcomes.
9. Boundary Effects: Proximity calculations near the boundaries of the dataset can be
problematic. Depending on the method used, the proximity of points near the edges of
the dataset may be underestimated, leading to biased results.
10.Temporal Dynamics: In dynamic datasets where objects or entities change over time,
maintaining accurate proximity calculations requires accounting for temporal
dynamics. Failing to consider temporal changes can lead to outdated or irrelevant
proximity measures.
Addressing these issues often requires a combination of careful preprocessing, selecting

appropriate algorithms and distance metrics, and considering the specific characteristics of
the dataset and the problem domain.
Explain about feature creation in data preprocessing?
1. Feature Construction:
o Interaction Features: Creating new features by combining existing features to
capture interactions or relationships between them.
o Derived Features: Generating derived features based on domain knowledge or
insights about the data. For example, calculating ratios, differences, or averages
of numerical variables.
o Temporal Features: Extracting time-related features such as day of the week,
month, season, or time since a specific event occurred.
2. Feature Aggregation:
o Group Statistics: Calculating summary statistics (e.g., mean, median, standard
deviation) of numerical features within groups defined by categorical variables.
o Temporal Aggregations: Aggregating temporal data into higher-level intervals
(e.g., hourly, daily, weekly) and computing statistics within each interval.
3. Feature Encoding:
o Target Encoding: Encoding categorical variables based on the target variable's
mean or frequency within each category.
o Frequency Encoding: Encoding categorical variables based on their frequency
or occurrence in the dataset.
o Binary Encoding: Converting categorical variables into binary representations
using techniques like one-hot encoding or binary hashing.
Effective feature creation in data mining requires a deep understanding of the

dataset, domain knowledge, and the specific requirements of the data mining task. It
often involves a combination of automated techniques, manual interventions, and
iterative experimentation to identify and construct features that maximize the
performance of data mining models.
VLITS Page 12
Briefly discuss about types of attributes and measurement?

Attribute:
It can be seen as a data field that represents characteristics or features
of a data object. For a customer object attributes can be customer Id, address
etc. The attribute types can represented as follows—
1. Nominal Attributes – related to names: The values of a Nominal
attribute are name of things, some kind of symbols. Values of Nominal
attributes represent some category or state and that’s why nominal
attribute also referred as categorical attributes.
Example:
Attribute Values
Colors Black, Green, Brown,
red
2. Binary Attributes: Binary data has only 2 values/states. For
Example yes or no, affected or unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).
3. Ordinal Attributes: The Ordinal Attributes contains values that have a

meaningful sequence or ranking (order) between them, but the magnitude
between values is not actually known, the order of values that shows what
is important but don’t indicate how important it is.
Attribute Values
Grade O, S, A, B, C, D, F
4. Numeric: A numeric attribute is quantitative because, it is a
measurable quantity, represented in integer or real values. Numerical
attributes are of 2 types.
i. An interval-scaled attribute has values, whose differences are
interpretable, but the numerical attributes do not have the correct
VLITS Page 13
reference point or we can call zero point. Data can be added and
subtracted at interval scale but cannot be multiplied or
divided.Consider an example of temperature in degrees Centigrade.
If a day’s temperature of one day is twice than the other day we
cannot say that one day is twice
as hot as another day.
i. A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a
measurement is ratio-scaled, we can say of a value as being a multiple
(or ratio) of another value. The values are ordered, and we can also
compute the difference between values, and the mean, median, mode,
Quantile-range and five number summaries can be given.
5. Discrete: Discrete data have finite values it can be numerical and
can also be in categorical form. These attributes has finite or
countably infinite set of values.
Example
Attribute Values
Profession Teacher, Business man, Peon
ZIP Code 521157, 521301
6. Continuous: Continuous data have infinite no of states. Continuous data

is of float type. There can be many values between 2 and 3.
Example:
Attribute Values
Height 5.4, 5.7, 6.2, etc.,
Weight 50, 65, 70, 73, etc.,
General Characteristics of Data Sets

Dimensionality The dimensionality of a data set is the number of attributes that
the objects in the data set possess. Data with a small number of dimensions
tends to be qualitatively different than moderate or high-dimensional data.
VLITS Page 14
Indeed, the difficulties associated with analyzing high- dimensional data are
sometimes referred to as the curse of dimensionality. Because of this,an
important motivation in preprocessing the data is dimensionality reduction.
Sparsity For some data sets, such as those with asymmetric features, most
attributes of an object have values of 0; in many cases, fewer than 1% of the
entries are non-zero.
Resolution It is frequently possible to obtain data at different levels of

resolution, and often the properties of the data are different at different
resolutions. For instance, the surface of the Earth seems very uneven at a
resolution of a few meters, but is relatively smooth at a resolution of tens of
kilometers.
Explain about similarity and dissimilarity between simple attributes and

Data objects?
l Similarity measure
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
l Dissimilarity measure
– Numerical measure of how different two data objects are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
l Proximity refers to a similarity or dissimilarity
VLITS Page 15
Also mention the examples in class notes.
Discuss about aggregation and sampling in data preprocessing?
Data cube aggregation:

For example, the data consists of AllElectronics sales per quarter
for the years 2014 to 2017.You are, however, interested in the annual
sales, rather than the total per quarter. Thus, the data can be
aggregatedso that the resulting data summarize the total sales per year
instead of per quarter.
Year/Quart 2014 201 201 2017 Year Sale

er 5 6 s
Quarter 1 200 210 320 230 2014 1640
Quarter 2 400 440 480 420 2015 1710
Quarter 3 480 480 540 460 2016 2020
VLITS Page 16
Quarter 4 560 580 680 640 2017 1750
1. Sampling:
Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller random
sample (or subset) of the data. Suppose that a large data set D, contains
N tuples, then the possible samples are Simple Random sample without
Replacement (SRS WOR) of size n: This is created by drawing „n‟ of
the „N‟ tuples from D (n<N), where the probability of drawing any
tuple in D is 1/N, i.e., all tuples are equally likely to be sampled.
Explain about various types of data sets?
Record Data:
Much data mining work assumes that the data set is a collection of records (data objects),
each of which consists of a fixed set of data fields (attributes). For the most basic form of
record data, there is no explicit relationship among records or data fields, and every
record (object) has the same set of attributes. Record data is usually stored either in flat
files or in relational databases.
Transaction or Market Basket Data Transaction data is a special type of

record data, where each record (transaction) involves a set of items. Con- sider a
grocery store. The set of products purchased by a customer during one shopping
trip constitutes a transaction, while the individual products that were purchased
VLITS Page 17
are the items. This type of data is called market basket data because the items
in each record are the products in a person’s “market basket.”
The Data Matrix If the data objects in a collection of data all have the
same fixed set of numeric attributes, then the data objects can be thought of as
points (vectors) in a multidimensional space, where each dimension represents a
distinct attribute describing the object. A set of such data objects can be
interpreted as an m by n matrix, where there are m rows, one for each object,
and n columns, one for each attribute. This matrix is called a data matrix or a
pattern matrix.
The Sparse Data Matrix A sparse data matrix is a special case of a data matrix
in which the attributes are of the same type and are asymmetric; i.e., only non-
zero values are important. Transaction data is an example of a sparse data matrix
that has only 0–1 entries. Another common example is document data. If the
order of the terms (words) in a document is ignored, then a document can be
represented as a term vector, where each term isa component (attribute) of
VLITS Page 18
the vector and the value of each component isthe number of times the
corresponding term occurs in the document. This representation of a collection of
documents is often called a document-term matrix.
Ordered Data
For some types of data, the attributes have relationships that involve order in
time or space. Sequential Data Sequential data, also referred to as temporal
data, canbe thought of as an extension of record data, where each record has a
time associated with it. Consider a retail transaction data set that also stores the
time at which the transaction took place. This time information makes it possible
to find patterns such as “candy sales peak before Halloween.” example of
sequential transaction data. There are five different times—t1, t2, t3, t4, and t5 ;
three different customers—C1, C2, and C3; and five different items—A, B, C,
D, and E. In the top table, each row corresponds to the items purchased at a
particular time by each customer. For instance, at time t3, customer C2
purchased items A and D. In the bottom table, the same information is displayed,
but each row corresponds to a particular customer. Each row contains
information on each transaction involving the customer, where a transaction is
considered to be a set of items and the time at which those items were purchased.
For example, customer C3 bought items A and C at time t2.
VLITS Page 19
What are the value ranges of the following normalization methods?

(i) min-max normalization
(ii) z-score normalization
(iii) z-score normalization using the mean
absolute deviation instead of standard deviation
(iv) normalization by decimal scaling.
a) Min-max normalization performs a linear transformation on the
original data. Suppose that minAand maxAare the minimum and
maximum values of an attribute, A.Min- maxnormalization maps a
value, vi, of A to vi’in the range [new_minA,new_maxA]by computing

Min-max normalization preserves the relationships among the original
data values. Itwill encounter an ―out-of-bounds‖ error if a future input
case for normalization fallsoutside of the original data range for A.
Example:-Min-max normalization. Suppose that the minimum and
maximum values fortheattribute income are $12,000 and $98,000,
respectively. We would like to map income to the range [0.0, 1.0]. By
min-max normalization, a value of $73,600 for income istransformed to
b) Z-Score Normalization
VLITS Page 20
The values for an attribute, A, are normalized based on the mean (i.e.,
average) and standard deviation of A. A value, vi, of A is normalized to
vi’ by computing
where𝐴 and A are the mean and standard deviation,
respectively, of attribute A. Example z-score normalization. Suppose
that the mean and standard deviation of the values for the attribute
income are $54,000 and $16,000, respectively. With z-score
normalization, a value of $73,600 for income is transformed to
c) Normalization by Decimal Scaling:

Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A. The number of decimal points moved
depends on the maximum absolute value of
A. A value, vi, of A is normalized to vi’ by computing
where j is the smallest integer such that max(|vi’)< 1.
Example Decimal scaling. Suppose that the recorded values of A range

from -986 to 917. Themaximum absolute value of A is 986. To
normalize by decimal scaling, we thereforedivide each value by 1000
(i.e., j = 3) so that -986 normalizes to -0.986 and 917normalizes to
0.917.
What is the need of dimensionality reduction? Explain any two techniques for
dimensionality reduction?
Dimensionality Reduction:
In dimensionality reduction, data encoding or transformations are
applied so as to obtained reduced or ―compressed‖ representation of the
oriental data.
Dimension Reduction Types
➢ Lossless - If the original data can be reconstructed from the
VLITS Page 21
compressed data without any loss of information

➢ Lossy - If the original data can be reconstructed from the
compressed data with loss of information, then the data reduction
is called lossy.
Effective methods in lossy dimensional reduction
a) Wavelet transforms
b) Principal components analysis.
a) Wavelet transforms:
The discrete wavelet transform (DWT) is a linear signal
processing technique that, when applied to a data vector, transforms it
to a numerically different vector, of wavelet coefficients. The two
vectors are of the same length. When applying this technique to data
reduction, we consider each tuple as an n-dimensional data vector, that
is, X=(x1,x2,…………,xn), depicting n measurements made on the
tuple from n database attributes.
For example, all wavelet coefficients larger than some user-
specified threshold can be retained. All other coefficients are set to 0.
The resulting data representation is therefore very sparse, so that can
take advantage of data sparsity are computationally very fast if
performed in wavelet space.
The numbers next to a wave let name is the number of vanishing
moment of the wavelet this is a set of mathematical relationships that
the coefficient must satisfy and is related to number of coefficients.
1. The length, L, of the input data vector must be an integer power

of 2. This condition can be met by padding the data vector with
zeros as necessary (L >=n).
2. Each transform involves applying two functions
• The first applies some data smoothing, such as a sum or weighted average.
• The second performs a weighted difference, which acts to
VLITS Page 22
bring out the detailed features of data.

3. The two functions are applied to pairs of data points in X, that is,
to all pairs of measurements (X2i , X2i+1). This results in two
sets of data of length L/2. In general,these represent a smoothed
or low-frequency version of the input data and high frequency
content of it, respectively.
4. The two functions are recursively applied to the sets of data
obtained in the previous loop, until the resulting data sets
obtained are of length 2.
b) Principal components analysis

Suppose that the data to be reduced, which Karhunen-Loeve, K-
L, method consists of tuples or data vectors describe by n attributes or
dimensions. Principal components analysis, or PCA (also called the
Karhunen-Loeve, or K-L, method), searches for k n-dimensional
orthogonal vectors that can best be used to represent the data where
k<=n. PCA combines the essence of attributes by creating an
alternative, smaller set of variables. The initial data can then be
projected onto this smaller set.
The basic procedure is as follows:
• The input data are normalized.
• PCA computes k orthonormal vectors that provide a basis for
the normalized input data. These are unit vectors that each point
in a direction perpendicular to the others.
• The principal components are sorted in order of decreasing significance or
strength.
In the above figure, Y1 and Y2, for the given set of data originally
VLITS Page 23
mapped to the axes X1 and X2. This information helps identify groups
or patterns within the data. The sorted axes are such that the first axis
shows the most variance among the data, the second axis shows the next
highest variance, and so on.
• The size of the data can be reduced by eliminating the weaker components.
Advantage of PCA
• PCA is computationally inexpensive
• Multidimensional data of more than two dimensions can be
handled by reducing the problem to two dimensions.
• Principal components may be used as inputs to multiple regression and cluster
analysis.
Discuss in detail about data transformation with suitable examples. Suppose that a
hospital tested the age and body fat data for 18 randomly selected
adults with the following results:
age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2
age 52 54 54 56 57 58 58 60 61
%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7
a. Normalize the two attributes based on z-score normalization.
Data transformation is a crucial step in data preprocessing that involves converting raw
data into a more suitable format for analysis, visualization, or modeling. One common
technique for data transformation is z-score normalization, also known as standardization.
Z-score normalization standardizes the data by subtracting the mean and dividing by the
standard deviation. This process transforms the data into a standard normal distribution
with a mean of 0 and a standard deviation of 1.
Let's go through the process of normalizing the age and body fat percentage data using z-
score normalization:
1. Calculate the Mean and Standard Deviation:

o For the age data:
▪ Mean (μ_age) = (23 + 23 + 27 + ... + 60 + 61) / 18 = 45.6
▪ Standard Deviation (σ_age) = √[(Σ(age_i - μ_age)^2) / (n - 1)]
o For the body fat percentage data:
▪ Mean (μ_fat) = (9.5 + 26.5 + 7.8 + ... + 41.2 + 35.7) / 18
▪ Standard Deviation (σ_fat) = √[(Σ(fat_i - μ_fat)^2) / (n - 1)]
2. Normalize the Data:
VLITS Page 24
oFor each age value (age_i) and body fat percentage value (fat_i), calculate the z-
score using the formulas:
▪ z_age = (age_i - μ_age) / σ_age
▪ z_fat = (fat_i - μ_fat) / σ_fat
o Repeat this calculation for all age and body fat percentage values in the dataset.
3. Normalized Data:
o After normalization, each attribute will have a mean of 0 and a standard
deviation of 1.
Now, let's apply this process to the given data:
Age (years): 23, 23, 27, 27, 39, 41, 47, 49, 50, 52, 54, 54, 56, 57, 58, 58, 60, 61 Body
Fat Percentage (%): 9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2, 31.2, 34.6, 42.5, 28.8,
33.4, 30.2, 34.1, 32.9, 41.2, 35.7
Using the formulas mentioned above, we calculate the mean and standard deviation
for age and body fat percentage. Then, we normalize each data point using these
values to obtain z-scores.
Once normalized, the data will be transformed into a standard normal distribution,
allowing for easier comparison and analysis across different attributes.
Let's proceed with the calculations:
Calculate Mean and Standard Deviation:
oFor age:
▪ Mean (μ_age) = 45.6
▪ Standard Deviation (σ_age) = 11.421
o For body fat percentage:
▪ Mean (μ_fat) = 27.9167
▪ Standard Deviation (σ_fat) = 8.457
2. Normalize the Data:
o For each age value (age_i), calculate the z-score using the formula:
z_age = (age_i - μ_age) / σ_age
o For each body fat percentage value (fat_i), calculate the z-score using
the formula: z_fat = (fat_i - μ_fat) / σ_fat
3. Normalized Data:
o After performing the calculations, we obtain the normalized values for
age and body fat percentage.
Here are the z-scores for the given data:
VLITS Page 25
Age (years): -1.965, -1.965, -1.393, -1.393, -0.049, 0.374, 1.226, 1.652, 1.856, 2.282,
2.708, 2.708, 3.134, 3.36, 3.586, 3.586, 3.812, 4.038
Body Fat Percentage (%): -1.336, 0.364, -1.579, -0.724, 1.101, 0.263, 0.486, 0.437,
1.024, 1.569, 2.657, 0.005, 1.376, 0.794, 1.428, 1.101, 2.065, 1.679
These normalized values have a mean of 0 and a standard deviation of 1, making them
suitable for further analysis or modelling.
6.What are issues in proximity calculation? Explain.
Proximity calculation involves determining the closeness or distance between objects or

points in a space. This concept is widely used in various fields such as geography, computer
science, physics, and engineering. However, there are several issues and challenges
associated with proximity calculation that need to be considered. Here are some common
issues:
1. Choice of Distance Metric:

o The choice of distance metric is crucial in proximity calculation. Different
metrics (e.g., Euclidean distance, Manhattan distance, Hamming distance) may
yield different results, and selecting an inappropriate metric for a specific
problem can lead to inaccurate proximity calculations.
2. Scale Sensitivity:
o Proximity calculations can be sensitive to the scale of the data. If the data
features are measured on different scales, it may cause certain dimensions to
dominate the distance calculation, leading to biased results. Standardization or
normalization techniques may be required to address this issue.
3. Curse of Dimensionality:
o As the number of dimensions increases, the distance between points tends to
increase, which can impact the effectiveness of proximity-based methods. This
phenomenon is known as the "curse of dimensionality," and it can lead to sparse
data distribution and increased computational complexity.
4. Handling Missing Data:
o In real-world scenarios, data may have missing values. Dealing with missing
data in proximity calculations requires careful consideration. Some distance
metrics may not handle missing values well, and imputation techniques or
specialized distance measures may be needed.
5. Outliers:
o Outliers can significantly affect proximity calculations, especially when using
metrics that are sensitive to extreme values. Robust distance metrics or
preprocessing techniques to identify and handle outliers may be necessary.
6. Computational Complexity:
VLITS Page 26
o Proximity calculations can be computationally expensive, particularly when

dealing with large datasets. Efficient algorithms and data structures (such as
spatial indexes) are required to optimize proximity calculations and reduce
computational burden.
7. Definition of Proximity:
o The definition of proximity itself can vary based on the context and application.
Proximity can be defined in terms of spatial distance, similarity, dissimilarity, or
other domain-specific measures. Choosing an appropriate definition is essential
for the relevance of the proximity calculations.
8. Temporal Dynamics:
o In dynamic environments where data changes over time, capturing temporal
dynamics in proximity calculations becomes important. Traditional static
proximity measures may not adequately represent the evolving relationships
between objects.
9. Contextual Considerations:
o Proximity is context-dependent, and the interpretation of proximity can vary
based on the specific problem at hand. Understanding the context and tailoring
proximity calculations to the specific needs of the application is crucial.
Addressing these issues requires a careful understanding of the problem domain, appropriate
algorithm selection, and consideration of the characteristics of the data involved.
Additionally, advancements in machine learning and data analysis techniques continue to
provide solutions and improvements in handling proximity-related challenges.
VLITS Page 27

DWDMunit 2

Uploaded by

Copyright:

Available Formats

DWDMunit 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWDMunit 2

Uploaded by

Copyright:

Available Formats

II(CAI,CSM,AIML,CSD)

Explain data mining as a KDD Process?(OR)Explain the architecture of

What are the major issues in data mining?

a. Data mining is a dynamic and fast-expanding field with great strengths.

either they represent common knowledge or lack novelty.

c) Efficiency and scalability

unstructured. Therefore mining the knowledge from them adds challenges to

Which technologies are used in datamining?

In machine learning, an algorithm is constructed to predict the data from the

4. Database systems and data warehouse

interchangeably with these terms.

following two ways −

2. Mining of Frequent Patterns

1. Classification (IF-THEN) Rules

The list of functions involved in these processes is as follows −

2. Prediction − It is used to predict missing or unavailable numerical data

3. Decision Trees − A decision tree is a structure that includes a root node,

4. Mathematical Formulae – Data can be mined by using some mathematical formulas.

5. Neural Networks − Neural networks represent a brain metaphor for

7. Evolution Analysis − Evolution analysis refers to the description and model

What kind of data can be mined?

Two approaches can be used to update data in

• It is dynamic in nature as Volume of data is continuously

2. Dimensionality: Proximity calculations can become increasingly complex as the

Addressing these issues often requires a combination of careful preprocessing, selecting

Explain about feature creation in data preprocessing?

Effective feature creation in data mining requires a deep understanding of the

Briefly discuss about types of attributes and measurement?

3. Ordinal Attributes: The Ordinal Attributes contains values that have a

ZIP Code 521157, 521301

6. Continuous: Continuous data have infinite no of states. Continuous data

General Characteristics of Data Sets

Resolution It is frequently possible to obtain data at different levels of

Explain about similarity and dissimilarity between simple attributes and

Also mention the examples in class notes.

Discuss about aggregation and sampling in data preprocessing?

Data cube aggregation:

Year/Quart 2014 201 201 2017 Year Sale

Quarter 4 560 580 680 640 2017 1750

Explain about various types of data sets?

Transaction or Market Basket Data Transaction data is a special type of

What are the value ranges of the following normalization methods?

value, vi, of A to vi’in the range [new_minA,new_maxA]by computing

c) Normalization by Decimal Scaling:

where j is the smallest integer such that max(|vi’)< 1.

Example Decimal scaling. Suppose that the recorded values of A range

compressed data without any loss of information

1. The length, L, of the input data vector must be an integer power

bring out the detailed features of data.

b) Principal components analysis

1. Calculate the Mean and Standard Deviation:

Now, let's apply this process to the given data:

Let's proceed with the calculations:

Calculate Mean and Standard Deviation:

Here are the z-scores for the given data:

6.What are issues in proximity calculation? Explain.

Proximity calculation involves determining the closeness or distance between objects or

1. Choice of Distance Metric: