data mining unit I notes
data mining unit I notes
1
1. DATA MINING OVERVIEW
Data Mining: Data mining is a process of extracting and discovering patterns in large data sets.
Data mining is also known as Knowledge Discovery from Data.
The knowledge discovery process (KDD) is shown in Figure as an iterative sequence of the
following steps:
2
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for
mining. The data mining step may interact with the user or a knowledge base. The interesting
patterns are presented to the user and may be stored as new knowledge in the knowledge base.
Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system dynamically.
Increasing revenue.
Understanding customer segments and preferences.
Acquiring new customers.
Improving cross-selling and up-selling.
Detecting fraud.
Identifying credit risks.
Monitoring operational performance.
Data mining defines extracting or mining knowledge from huge amounts of data. Data mining
is generally used in places where a huge amount of data is saved and processed. For example, the
banking system uses data mining to save huge amounts of data which is processed constantly.
In Data mining, hidden patterns of data are considering according to the multiple categories
into a piece of useful data. This data is assembled in an area including data warehouses for
analyzing it, and data mining algorithms are performed. This data facilitates in creating effective
decisions which cut value and increase revenue.
There are various types of data mining applications that are used for data are as follows −
3
(ER) data model is generally constructed for relational databases. An ER data model defines the
database as a set of entities and their relationships.
Data mining functionalities are used to specify the kinds of patterns to be found in data
mining tasks. In general, such tasks can be classified into two categories: descriptive and
predictive. Descriptive mining tasks characterize properties of the data in a target data set.
Predictive mining tasks perform induction on the current data in order to make predictions
4
Data mining functionalities, and the kinds of patterns they can discover, are described below.
i. Class/Concept Description: Characterization and Discrimination
For example, in the AllElectronics store, classes of items for sale include computers and
printers, and concepts of customers include bigSpenders and budgetSpenders. It can be useful to
describe individual classes and concepts in summarized, concise, and yet precise terms. Such
descriptions of a class or a concept are called class/concept descriptions.
(2) data discrimination, by comparison of the target class with one or a set of comparative
classes (often called the contrasting classes), or
The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including
crosstabs. The resulting descriptions can also be presented as generalized relations or in rule form
There are many kinds of frequent patterns, including frequent itemsets, frequent
subsequences (also known as sequential patterns), and frequent substructures.
A frequent itemset typically refers to a set of items that often appear together in a
transactional data set—for example, milk and bread, which are frequently bought together in
grocery stores by many customers.
A frequently occurring subsequence, such as the pattern that customers, tend to purchase
first a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential
pattern.
A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may
be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern.
Mining frequent patterns leads to the discovery of interesting associations and correlations
within data.
5
Example : Association analysis. Suppose that, as a marketing manager at AllElectronics, you
want to know which items are frequently purchased together (i.e., within the same transaction).
An example of such a rule, mined from the AllElectronics transactional database, is
Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts. The model are derived based on the analysis of a set of training data (i.e.,
data objects for which the class labels are known). The model is used to predict the class label of
objects for which the class label is unknown.
“How is the derived model presented?” The derived model may be represented in various
forms, such as classification rules (i.e., IF-THEN rules), decision trees, mathematical formulae, or
neural networks. A decision tree is a flowchart-like tree structure, where each node denotes a
test on an attribute value, each branch represents an outcome of the test, and tree leaves
represent classes or class distributions.
Decision trees can easily be converted to classification rules. A neural network, when used
for classification, is typically a collection of neuron-like processing units with weighted
connections between the units.
Regression analysis is a statistical methodology that is most often used for numeric
prediction, although other methods exist as well. Regression also encompasses the identification
of distribution trends based on the available data.
6
iv. Cluster Analysis
Unlike classification and regression, which analyze class-labeled (training) data sets,
clustering analyzes data objects without consulting class labels. In many cases, class labeled data
may simply not exist at the beginning. Clustering can be used to generate class labels for a group
of data. The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity. That is, clusters of objects are formed so that
objects within a cluster have high similarity in comparison to one another, but are rather dissimilar
to objects in other clusters.
v. Outlier Analysis
A data set may contain objects that do not comply with the general behavior or model of the
data. These data objects are outliers. Many data mining methods discard outliers as noise or
exceptions. However, in some applications (e.g., fraud detection) the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred to as
outlier analysis or anomaly mining.
A pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test
data with some degree of certainty, (3) potentially useful, and (4) novel. A pattern is also
interesting if it validates a hypothesis that the user sought to confirm. An interesting pattern
represents knowledge.
Several objective measures of pattern interestingness exist. These are based on the
structure of discovered patterns and the statistics underlying them. An objective measure for
7
association rules of the form X ⇒ Y is rule support, representing the percentage of transactions
from a transaction database that the given rule satisfies. This is taken to be the probability P(X
∪Y), where X ∪Y indicates that a transaction contains both X and Y, that is, the union of item sets
X and Y. Another objective measure for association rules is confidence, which assesses the degree
of certainty of the detected association. This is taken to be the conditional probability P(Y|X), that
is, the probability that a transaction containing X also contains Y. More formally, support and
confidence are defined as
confidence(X ⇒ Y) = P(Y|X)
4. TECHNOLOGIES USED
As a highly application-driven domain, data mining has incorporated many techniques from
other domains such as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, algorithms, highperformance computing,
and many application domains.
i. Statistics
8
ii. Machine Learning
Machine learning investigates how computers can learn (or improve their performance) based
on data. A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data.
Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set.
Unsupervised learning is essentially a synonym for clustering. The learning process is
unsupervised since the input examples are not class labeled. Typically, we may use clustering
to discover classes within the data.
Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model. In one approach, labeled examples
are used to learn class models and unlabeled examples are used to refine the boundaries
between classes. For a two-class problem, we can think of the set of examples belonging to
one class as the positive examples and those belonging to the other class as the negative
examples.
Active learning is a machine learning approach that lets users play an active role in the
learning process. An active learning approach can ask a user (e.g., a domain expert) to label
an example, which may be from a set of unlabeled examples or synthesized by the learning
program. The goal is to optimize the model quality by actively acquiring knowledge from
human users, given a constraint on how many examples they can be asked to label
Database systems research focuses on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems researchers have established highly
recognized principles in data models, query languages, query processing and optimization
methods, data storage, and indexing and accessing methods. Database systems are often well
known for their high scalability in processing very large, relatively structured data sets.
9
5. DATA MINING APPLICATIONS
Data mining is widely used in diverse areas. There are a number of commercial data
mining system available today and yet there are many challenges in this field.
Data Mining Applications :
i. Financial Data Analysis :
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining.
Design and construction of data warehouses for multidimensional data analysis and
data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
ii. Retail Industry :
Data Mining has its great application in Retail Industry because it collects large amount
of data from on sales, customer purchasing history, goods transportation, consumption and
services.
Design and Construction of data warehouses based on the benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Customer Retention.
Product recommendation and cross-referencing of items.
iii. Telecommunication Industry :
The telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e-mail, web
data transmission, etc.
Multidimensional Analysis of Telecommunication data.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
iv. Biological Data Analysis :
Biological data mining is a very important part of Bioinformatics.
Semantic integration of heterogeneous databases.
Alignment, indexing, similarity search and comparative analysis.
Discovery of structural patterns and analysis of genetic networks.
Association and path analysis.
Visualization tools in genetic data analysis.
10
v. Other Scientific Applications :
Following are the applications of data mining in the field of Scientific Applications −
Data Warehouses and Data Preprocessing.
Graph-based mining.
Visualization and domain specific knowledge.
vi. Intrusion Detection :
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability
of network resources.
Development of data mining algorithm for intrusion detection.
Association and correlation analysis, aggregation to help select and build
discriminating attributes.
Analysis of Stream data.
Distributed data mining.
Visualization and query tools.
6. MAJOR ISSUES IN DATA MINING
Data mining is not an easy task, as the algorithms used can get very complex and data
is not always available at one place. It needs to be integrated from various
heterogeneous data sources.
These factors also create some issues. The major issues are −
i. Mining Methodology and User Interaction.
ii. Performance Issues.
iii. Diverse Data Types Issues.
11
i. Mining Methodology and User Interaction.
12
7. DATA OBJECTS AND ATTRIBUTE TYPES
i. Nominal
ii. Ordinal
iii. Interval
iv. Ratio.
i. Nominal Attribute
The values of a nominal attribute are symbols or names of things.
– Each value represents some kind of category, code, or state,
Nominal attributes are also referred to as categorical attributes.
The values of nominal attributes do not have any meaningful order.
13
Example: The attribute marital_status can take on the values single, married,
divorced, and widowed.
Because nominal attribute values do not have any meaningful order about them and
they are not quantitative.
o It makes no sense to find the mean (average) value or median (middle) value
for such an attribute.
o However, we can find the attribute’s most commonly occurring value (mode).
A binary attribute is a special nominal attribute with only two states: 0 or 1.
A binary attribute is symmetric if both of its states are equally valuable and carry the same
weight. – Example: the attribute gender having the states male and female.
• A binary attribute is asymmetric if the outcomes of the states are not equally important.
Example: Positive and negative outcomes of a medical test for HIV.
ii. Ordinal Attribute
An ordinal attribute is an attribute with possible values that have a meaningful order
or ranking among them, but the magnitude between successive values is not known.
Example: An ordinal attribute drink_size corresponds to the size of drinks available at
a fast-food restaurant.
o This attribute has three possible values: small, medium, and large.
o The values have a meaningful sequence (which corresponds to increasing
drink size); however, we cannot tell from the values how much bigger, say,
a medium is than a large.
The central tendency of an ordinal attribute can be represented by its mode and its
median (middle value in an ordered sequence), but the mean cannot be defined
iii. Interval Attribute
Interval attributes are measured on a scale of equal-size units. We can compare and
quantify the difference between values of interval attributes.
Example: A temperature attribute is an interval attribute.
– We can quantify the difference between values. For example, a temperature
of 20oC is five degrees higher than a temperature of 15oC.
– Temperatures in Celsius do not have a true zero-point, that is, 0oC does not
indicate “no temperature.”
o Although we can compute the difference between temperature values, we
cannot talk of one temperature value as being a multiple of another.
14
Without a true zero, we cannot say, for instance, that 10oC is twice as warm as 5oC .
That is, we cannot speak of the values in terms of ratios.
The central tendency of an interval attribute can be represented by its mode, its
median (middle value in an ordered sequence), and its mean.
15
8. BASIC STATISTICAL DESCRIPTIONS OF DATA
Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
For data preprocessing tasks, we want to learn about data characteristics regarding
both central tendency and dispersion of the data.
Measures of central tendency include mean, median, mode, and midrange.
Measures of data dispersion include quartiles, interquartile range (IQR), and
variance.
These descriptive statistics are of great help in understanding the distribution of the
data.
Although the mean is the single most useful quantity for describing a data set, it is not
always the best way of measuring the center of the data.
o A major problem with the mean is its sensitivity to extreme (outlier) values.
o Even a small number of extreme values can corrupt the mean.
To offset the effect caused by a small number of extreme values, we can instead use
the trimmed mean,
Trimmed mean can be obtained after chopping off values at the high and low
extremes.
Example :What are central tendency measures (mean, median, mode)for the following
attributes? attr1 = {2,4,4,6,8,24}
mean = (2+4+4+6+8+24)/6 = 8 average of all values
median = (4+6)/2 = 5 avg. of two middle values
mode = 4 most frequent item
attr2 = {2,4,7,10,12}
mean = (2+4+7+10+12)/5 = 7 average of all values
17
median = 7 middle value
mode = any of them (no mode) all of them has same freq.
IQR=Q3-Q1
– five number summary: Minimum, Q1, Median, Q3, Maximum
• Variance and Standard Deviation: (sample: s, population: σ)
variance of N observations
i. Quartiles
Suppose that set of observations for numeric attribute X is sorted in increasing order.
Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets.
Quartiles: The 4-quantiles are the three data points that split the data distribution into
four equal parts; each part represents one-fourth of the data distribution.
ii. Outliers
Outliers can be identified by the help of interquartile range or standard deviation measures.
– Suspected outliers are values falling at least 1.5xIQR above the third quartile
or below the first quartile.
– Suspected outliers are values that fall outside of the range of μ–Nσ and μ+Nσ
where μ is mean and σ is standard deviation. N can be chosen as 2.5.
18
The normal distribution curve: (μ: mean, σ: standard deviation)
– From μ–σ to μ+σ: contains about 68% of the measurements
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it
iii. Boxplot Analysis
• Five-number summary of a distribution:
Minimum, Q1, Median, Q3, Maximum
• Boxplots are a popular way of visualizing a distribution and a boxplot incorporates
five-number summary:
– The ends of the box are at the quartiles Q1 and Q3, so that the box length is
the interquartile range, IQR.
– The median is marked by a line within the box. (median of values in IQR)
– Two lines outside the box extend to the smallest and largest observations
(outliers are excluded). Outliers are marked separately.
• If there are no outliers, lower extreme line is the smallest observation (Minimum) and
upper extreme line is the largest observation (Maximum).
19
8.3 Graphic Display of Basic Statistical Description of the Data
It include quantile plots, quantile–quantile plots, histograms, and scatter plots. Such
graphs are helpful for the visual inspection of data, which is useful for data preprocessing. The
first three of these show univariate distributions (i.e., data for one attribute), while scatter plots
show bivariate distributions (i.e., involving two attributes).
i. Quantile Plots
Displays all of the data (allowing the user to assess both the overall behavior and
unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that approximately 100
fi% of the data are below or equal to the value xi
A straight line that represents the case of when, for each given quantile, the unit price at each
branch is the same
20
iii. Scatter Plot
A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
– To construct a scatter plot, each pair of values is treated as a pair of coordinates in an
algebraic sense and plotted as points in the plane.
• The scatter plot is a useful method for providing a first look at bivariate data to see
clusters of points and outliers, or to explore the possibility of correlation relationships.
• Two attributes, X, and Y, are correlated if one attribute implies the other.
• Correlations can be positive, negative, or null (uncorrelated).
iv. Histograms
• A histogram represents the frequencies of values of a variable bucketed into ranges.
• Histogram is similar to bar chat but the difference is it groups the values into
continuous ranges
21
9. MEASURING DATA SIMILARITY AND DISSIMILARITY.
Similarity
• The similarity between two objects is a numerical measure of the degree to which
thetwo objects are alike.
• Similarities are higher for pairs of objects that are more alike.
• Similarities are usually non-negative and are often between 0 (no similarity) and 1
(complete similarity).
Dissimilarity
• The dissimilarity between two objects is a numerical measure of the degree to which
the two objects are different.
• Dissimilarities are lower for more similar pairs of objects.
• The term distance is used as a synonym for dissimilarity, although the
distance isoften used to refer to a special class of dissimilarities.
• Dissimilarities sometimes fall in the interval [0,1], but it is also common for them to
range from 0 to ∞.
Proximity refers to a similarity or dissimilarity
22
ii. Dissimilarities between data objects- Distance on Numeric Data : Euclidean Distance
where n is the number of attributes and xk and xk are kth attributes of data objects x and y.
• Normally attributes are numeric.
• Standardization is necessary, if scales of attributes differ.
23
iv. Cosine Similarity
Example
24