Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

data mining unit I notes

The document provides a comprehensive overview of data mining, detailing its processes, functionalities, and applications across various industries. It outlines key steps in the knowledge discovery process, types of data that can be mined, and the technologies utilized in data mining. Additionally, it discusses the advantages of data mining, major issues faced, and its relevance in fields such as finance, retail, telecommunications, and bioinformatics.

Uploaded by

Jagadish bollu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

data mining unit I notes

The document provides a comprehensive overview of data mining, detailing its processes, functionalities, and applications across various industries. It outlines key steps in the knowledge discovery process, types of data that can be mined, and the technologies utilized in data mining. Additionally, it discusses the advantages of data mining, major issues faced, and its relevance in fields such as finance, retail, telecommunications, and bioinformatics.

Uploaded by

Jagadish bollu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT – I

1. Data Mining Overview


2. Kinds of Data can be mined
3. Data Mining Functionalities and Kinds of patterns can be mined
4. Technologies used
5. Data Mining Applications
6. Major issues in Data Mining
7. Data objects and attribute types
8. Basic statistical descriptions of data
9. Measuring Data Similarity and Dissimilarity.

1
1. DATA MINING OVERVIEW

Data Mining: Data mining is a process of extracting and discovering patterns in large data sets.
Data mining is also known as Knowledge Discovery from Data.

– Extraction of interesting (non-trivial, implicit, previously unknown and potentially


useful) patterns or knowledge from huge amount of data.

Alternative names : – Knowledge Discovery in Databases (KDD), knowledge extraction,


data/pattern analysis, business intelligence, etc.

The knowledge discovery process (KDD) is shown in Figure as an iterative sequence of the
following steps:

 Data cleaning - to remove noise and inconsistent data


 Data integration - where multiple data sources may be combined
 Data selection - where data relevant to the analysis task are retrieved from the database
 Data transformation -where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations
 Data mining - an essential process where intelligent methods are applied to extract data
patterns
 Pattern evaluation -to identify the truly interesting patterns representing knowledge
based on interestingness measures
 Knowledge presentation -where visualization and knowledge representation techniques
are used to present mined knowledge to users)

2
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for
mining. The data mining step may interact with the user or a knowledge base. The interesting
patterns are presented to the user and may be stored as new knowledge in the knowledge base.

Data mining is the process of discovering interesting patterns and knowledge from large
amounts of data. The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system dynamically.

Advantages of Data Mining

 Increasing revenue.
 Understanding customer segments and preferences.
 Acquiring new customers.
 Improving cross-selling and up-selling.
 Detecting fraud.
 Identifying credit risks.
 Monitoring operational performance.

2. KINDS OF DATA CAN BE MINED

Data mining defines extracting or mining knowledge from huge amounts of data. Data mining
is generally used in places where a huge amount of data is saved and processed. For example, the
banking system uses data mining to save huge amounts of data which is processed constantly.

In Data mining, hidden patterns of data are considering according to the multiple categories
into a piece of useful data. This data is assembled in an area including data warehouses for
analyzing it, and data mining algorithms are performed. This data facilitates in creating effective
decisions which cut value and increase revenue.

There are various types of data mining applications that are used for data are as follows −

 Relational Databases − A database system is also called a database management system.


It includes a set of interrelated data, called a database, and a set of software programs to
handle and access the data.
A relational database is a set of tables, each of which is authorized a unique name. Each table
includes a set of attributes (columns or fields) and generally stores a huge set of tuples (records
or rows). Each tuple in a relational table defines an object identified by a unique key and
represented by a set of attribute values. A semantic data model including an entity-relationship

3
(ER) data model is generally constructed for relational databases. An ER data model defines the
database as a set of entities and their relationships.

 Transactional Databases − A transactional database includes a file where each record


defines a transaction. A transaction generally contains a unique transaction identity
number (trans ID) and a list of the items creating up the transaction (such as items
purchased in a store).
The transactional database can have additional tables related to it, which includes other data
regarding the sale, including the date of the transaction, the customer ID number, the ID number
of the salesperson and of the branch at which the sale appeared, etc.

 Object-Relational Databases − Object-relational databases are assembled based on an


object-relational data model. This model continues the relational model by supporting a
rich data type for managing complex objects and object orientation.
 Temporal Database − A temporal database generally stores relational data that contains
time-related attributes. These attributes can include multiple timestamps, each having
several semantics.
 Sequence Database − A sequence database stores sequences of ordered events, with or
without a factual idea of time. For example, customer shopping sequences, Web click
streams, and biological sequences.
 Time-Series Database − A time-series database stores sequences of values or events
accessed over repeated measurements of time (e.g., hourly, daily, weekly). An example
includes data collected from the stock exchange, stock control, and the measurement of
natural phenomena (like temperature and wind).

3. DATA MINING FUNCTIONALITIES AND KINDS OF PATTERNS CAN BE MINED


There are a number of data mining functionalities.
These include
o characterization and discrimination ; the mining of frequent patterns
o associations, and correlations; classification and regression;
o clustering analysis;
o outlier analysis.

Data mining functionalities are used to specify the kinds of patterns to be found in data
mining tasks. In general, such tasks can be classified into two categories: descriptive and
predictive. Descriptive mining tasks characterize properties of the data in a target data set.
Predictive mining tasks perform induction on the current data in order to make predictions

4
Data mining functionalities, and the kinds of patterns they can discover, are described below.
i. Class/Concept Description: Characterization and Discrimination
For example, in the AllElectronics store, classes of items for sale include computers and
printers, and concepts of customers include bigSpenders and budgetSpenders. It can be useful to
describe individual classes and concepts in summarized, concise, and yet precise terms. Such
descriptions of a class or a concept are called class/concept descriptions.

These descriptions can be derived using


(1) data characterization, by summarizing the data of the class under study (often called the
target class) in general terms, or

(2) data discrimination, by comparison of the target class with one or a set of comparative
classes (often called the contrasting classes), or

(3) both data characterization and discrimination

The output of data characterization can be presented in various forms. Examples include pie
charts, bar charts, curves, multidimensional data cubes, and multidimensional tables, including
crosstabs. The resulting descriptions can also be presented as generalized relations or in rule form

ii. Mining Frequent Patterns, Associations, and Correlations


Frequent patterns, as the name suggests, are patterns that occur frequently in data.

There are many kinds of frequent patterns, including frequent itemsets, frequent
subsequences (also known as sequential patterns), and frequent substructures.

A frequent itemset typically refers to a set of items that often appear together in a
transactional data set—for example, milk and bread, which are frequently bought together in
grocery stores by many customers.

A frequently occurring subsequence, such as the pattern that customers, tend to purchase
first a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential
pattern.

A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may
be combined with itemsets or subsequences. If a substructure occurs frequently, it is called a
(frequent) structured pattern.

Mining frequent patterns leads to the discovery of interesting associations and correlations
within data.

5
Example : Association analysis. Suppose that, as a marketing manager at AllElectronics, you
want to know which items are frequently purchased together (i.e., within the same transaction).
An example of such a rule, mined from the AllElectronics transactional database, is

buys(X, “computer”) ⇒ buys(X, “software”) [support = 1%,confidence = 50%],

iii. Classification and Regression for Predictive Analysis

Classification is the process of finding a model (or function) that describes and distinguishes
data classes or concepts. The model are derived based on the analysis of a set of training data (i.e.,
data objects for which the class labels are known). The model is used to predict the class label of
objects for which the class label is unknown.

“How is the derived model presented?” The derived model may be represented in various
forms, such as classification rules (i.e., IF-THEN rules), decision trees, mathematical formulae, or
neural networks. A decision tree is a flowchart-like tree structure, where each node denotes a
test on an attribute value, each branch represents an outcome of the test, and tree leaves
represent classes or class distributions.

Decision trees can easily be converted to classification rules. A neural network, when used
for classification, is typically a collection of neuron-like processing units with weighted
connections between the units.

Regression analysis is a statistical methodology that is most often used for numeric
prediction, although other methods exist as well. Regression also encompasses the identification
of distribution trends based on the available data.

6
iv. Cluster Analysis

Unlike classification and regression, which analyze class-labeled (training) data sets,
clustering analyzes data objects without consulting class labels. In many cases, class labeled data
may simply not exist at the beginning. Clustering can be used to generate class labels for a group
of data. The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity. That is, clusters of objects are formed so that
objects within a cluster have high similarity in comparison to one another, but are rather dissimilar
to objects in other clusters.

v. Outlier Analysis

A data set may contain objects that do not comply with the general behavior or model of the
data. These data objects are outliers. Many data mining methods discard outliers as noise or
exceptions. However, in some applications (e.g., fraud detection) the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred to as
outlier analysis or anomaly mining.

vi. Are All Patterns Interesting?

A pattern is interesting if it is (1) easily understood by humans, (2) valid on new or test
data with some degree of certainty, (3) potentially useful, and (4) novel. A pattern is also
interesting if it validates a hypothesis that the user sought to confirm. An interesting pattern
represents knowledge.

Several objective measures of pattern interestingness exist. These are based on the
structure of discovered patterns and the statistics underlying them. An objective measure for

7
association rules of the form X ⇒ Y is rule support, representing the percentage of transactions
from a transaction database that the given rule satisfies. This is taken to be the probability P(X
∪Y), where X ∪Y indicates that a transaction contains both X and Y, that is, the union of item sets
X and Y. Another objective measure for association rules is confidence, which assesses the degree
of certainty of the detected association. This is taken to be the conditional probability P(Y|X), that
is, the probability that a transaction containing X also contains Y. More formally, support and
confidence are defined as

support(X ⇒ Y) = P(X ∪Y),

confidence(X ⇒ Y) = P(Y|X)

4. TECHNOLOGIES USED

As a highly application-driven domain, data mining has incorporated many techniques from
other domains such as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, algorithms, highperformance computing,
and many application domains.

i. Statistics

Statistics studies the collection, analysis, interpretation or explanation, and presentation of


data. Data mining has an inherent connection with statistics. A statistical model is a set of
mathematical functions that describe the behavior of the objects in a target class in terms of
random variables and their associated probability distributions.

8
ii. Machine Learning
Machine learning investigates how computers can learn (or improve their performance) based
on data. A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data.

Supervised learning is basically a synonym for classification. The supervision in the learning
comes from the labeled examples in the training data set.
Unsupervised learning is essentially a synonym for clustering. The learning process is
unsupervised since the input examples are not class labeled. Typically, we may use clustering
to discover classes within the data.
Semi-supervised learning is a class of machine learning techniques that make use of both
labeled and unlabeled examples when learning a model. In one approach, labeled examples
are used to learn class models and unlabeled examples are used to refine the boundaries
between classes. For a two-class problem, we can think of the set of examples belonging to
one class as the positive examples and those belonging to the other class as the negative
examples.
Active learning is a machine learning approach that lets users play an active role in the
learning process. An active learning approach can ask a user (e.g., a domain expert) to label
an example, which may be from a set of unlabeled examples or synthesized by the learning
program. The goal is to optimize the model quality by actively acquiring knowledge from
human users, given a constraint on how many examples they can be asked to label

iii. Database Systems and Data Warehouses

Database systems research focuses on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems researchers have established highly
recognized principles in data models, query languages, query processing and optimization
methods, data storage, and indexing and accessing methods. Database systems are often well
known for their high scalability in processing very large, relatively structured data sets.

iv. Information Retrieval

Information retrieval (IR) is the science of searching for documents or information in


documents. Documents can be text or multimedia, and may reside on the Web. The differences
between traditional information retrieval and database systems are twofold: Information retrieval
assumes that (1) the data under search are unstructured; and (2) the queries are formed mainly
by keywords, which do not have complex structures (unlike SQL queries in database systems).

9
5. DATA MINING APPLICATIONS
Data mining is widely used in diverse areas. There are a number of commercial data
mining system available today and yet there are many challenges in this field.
Data Mining Applications :
i. Financial Data Analysis :
 The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining.
 Design and construction of data warehouses for multidimensional data analysis and
data mining.
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
ii. Retail Industry :
Data Mining has its great application in Retail Industry because it collects large amount
of data from on sales, customer purchasing history, goods transportation, consumption and
services.
 Design and Construction of data warehouses based on the benefits of data mining.
 Multidimensional analysis of sales, customers, products, time and region.
 Customer Retention.
 Product recommendation and cross-referencing of items.
iii. Telecommunication Industry :
The telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e-mail, web
data transmission, etc.
 Multidimensional Analysis of Telecommunication data.
 Identification of unusual patterns.
 Multidimensional association and sequential patterns analysis.
 Mobile Telecommunication services.
iv. Biological Data Analysis :
Biological data mining is a very important part of Bioinformatics.
 Semantic integration of heterogeneous databases.
 Alignment, indexing, similarity search and comparative analysis.
 Discovery of structural patterns and analysis of genetic networks.
 Association and path analysis.
 Visualization tools in genetic data analysis.

10
v. Other Scientific Applications :
Following are the applications of data mining in the field of Scientific Applications −
 Data Warehouses and Data Preprocessing.
 Graph-based mining.
 Visualization and domain specific knowledge.
vi. Intrusion Detection :
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability
of network resources.
 Development of data mining algorithm for intrusion detection.
 Association and correlation analysis, aggregation to help select and build
discriminating attributes.
 Analysis of Stream data.
 Distributed data mining.
 Visualization and query tools.

6. MAJOR ISSUES IN DATA MINING
Data mining is not an easy task, as the algorithms used can get very complex and data
is not always available at one place. It needs to be integrated from various
heterogeneous data sources.
These factors also create some issues. The major issues are −
i. Mining Methodology and User Interaction.
ii. Performance Issues.
iii. Diverse Data Types Issues.

11
i. Mining Methodology and User Interaction.

It refers to the following kinds of issues −

 Mining different kinds of knowledge in Databases − Different users may be interested in


different kinds of knowledge.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns.
 Incorporation of background knowledge − Background knowledge may be used to express
the discovered patterns not only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language.
 Presentation and visualization of data mining results − Once the patterns are discovered
it needs to be expressed in high level languages, and visual representations.
 Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities.
 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

ii. Performance Issues


 Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be
efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate
the development of parallel and distributed data mining algorithms.

iii. Diverse Data Types Issues :


 Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible
for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured.

12
7. DATA OBJECTS AND ATTRIBUTE TYPES

 Data sets are made up of data objects.


 A data object represents an entity.
o Also called sample, example, instance, data point, object, tuple.
 Data objects are described by attributes.
 An attribute is a property or characteristic of a data object.
o Examples: eye color of a person, temperature, etc.
o Attribute is also known as variable, field, characteristic, or feature
 A collection of attributes describe an object.
 Attribute values are numbers or symbols assigned to an attribute.
Attributes
Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
– E.g., customer _ID, name, address
• Attribute values are numbers or symbols assigned to an attribute
• Distinction between attributes and attribute values
Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different; ID has no limit but age has a
maximum and minimum value
There are four main types of attributes. They are

i. Nominal
ii. Ordinal
iii. Interval
iv. Ratio.

i. Nominal Attribute
 The values of a nominal attribute are symbols or names of things.
– Each value represents some kind of category, code, or state,
 Nominal attributes are also referred to as categorical attributes.
 The values of nominal attributes do not have any meaningful order.

13
 Example: The attribute marital_status can take on the values single, married,
divorced, and widowed.
 Because nominal attribute values do not have any meaningful order about them and
they are not quantitative.
o It makes no sense to find the mean (average) value or median (middle) value
for such an attribute.
o However, we can find the attribute’s most commonly occurring value (mode).
A binary attribute is a special nominal attribute with only two states: 0 or 1.
A binary attribute is symmetric if both of its states are equally valuable and carry the same
weight. – Example: the attribute gender having the states male and female.
• A binary attribute is asymmetric if the outcomes of the states are not equally important.
Example: Positive and negative outcomes of a medical test for HIV.
ii. Ordinal Attribute
 An ordinal attribute is an attribute with possible values that have a meaningful order
or ranking among them, but the magnitude between successive values is not known.
 Example: An ordinal attribute drink_size corresponds to the size of drinks available at
a fast-food restaurant.
o This attribute has three possible values: small, medium, and large.
o The values have a meaningful sequence (which corresponds to increasing
drink size); however, we cannot tell from the values how much bigger, say,
a medium is than a large.
 The central tendency of an ordinal attribute can be represented by its mode and its
median (middle value in an ordered sequence), but the mean cannot be defined
iii. Interval Attribute
Interval attributes are measured on a scale of equal-size units. We can compare and
quantify the difference between values of interval attributes.
 Example: A temperature attribute is an interval attribute.
– We can quantify the difference between values. For example, a temperature
of 20oC is five degrees higher than a temperature of 15oC.
– Temperatures in Celsius do not have a true zero-point, that is, 0oC does not
indicate “no temperature.”
o Although we can compute the difference between temperature values, we
cannot talk of one temperature value as being a multiple of another.

14
 Without a true zero, we cannot say, for instance, that 10oC is twice as warm as 5oC .
That is, we cannot speak of the values in terms of ratios.
 The central tendency of an interval attribute can be represented by its mode, its
median (middle value in an ordered sequence), and its mean.

iv. Ratio Attribute


A ratio attribute is a numeric attribute with an inherent zero-point.
 Example: A number_of_words attribute is a ratio attribute.
o If a measurement is ratio-scaled, we can speak of a value as being a
multiple (or ratio) of another value.
 The central tendency of an ratio attribute can be represented by its mode, its median
(middle value in an ordered sequence), and its mean.
The type of an attribute depends on which of the following properties it possesses:
– Distinctness: =
– Order: < >
– Addition: + -
– Multiplication: * /
• Nominal attribute: distinctness
• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & addition
• Ratio attribute: all 4 properties
Nominal and Ordinal attributes are collectively referred to as categorical or qualitative
attributes.
– qualitative attributes, such as employee ID, lack most of the properties of numbers.
– Even if they are represented by numbers, i.e. , integers, they should be treated more
like symbols .
– Mean of values does not have any meaning.
Interval and Ratio are collectively referred to as quantitative or numeric attributes.
– Quantitative attributes are represented by numbers and have most of the properties
of numbers .
– Note that quantitative attributes can be integer-valued or continuous.
– Numeric operations such as mean, standard deviation are meaningful

15
8. BASIC STATISTICAL DESCRIPTIONS OF DATA
 Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
 For data preprocessing tasks, we want to learn about data characteristics regarding
both central tendency and dispersion of the data.
 Measures of central tendency include mean, median, mode, and midrange.
 Measures of data dispersion include quartiles, interquartile range (IQR), and
variance.
 These descriptive statistics are of great help in understanding the distribution of the
data.

8.1 Measuring Central Tendency

i. Measuring Central Tendency : Mean


The most common and most effective numerical measure of the “center” of a set of
data is the arithmetic mean.

 Although the mean is the single most useful quantity for describing a data set, it is not
 always the best way of measuring the center of the data.
o A major problem with the mean is its sensitivity to extreme (outlier) values.
o Even a small number of extreme values can corrupt the mean.
 To offset the effect caused by a small number of extreme values, we can instead use
 the trimmed mean,
 Trimmed mean can be obtained after chopping off values at the high and low
extremes.

ii. Measuring the central tendency : Median


 Another measure of the center of data is the median.
 Suppose that a given data set of N distinct values is sorted in numerical order.
o If N is odd, the median is the middle value of the ordered set;
o If N is even, the median is the average of the middle two values.
 In probability and statistics, the median generally applies to numeric data; however,
we may extend the concept to ordinal data.
16
o Suppose that a given data set of N values for an attribute X is sorted in
increasing order.
o If N is odd, then the median is the middle value of the ordered set.
o If N is even, then the median may not be not unique.
 In this case, the median is the two middlemost values and any value in between.

iii. Measuring the central tendency : Mode


 Another measure of central tendency is the mode.
 The mode for a set of data is the value that occurs most frequently in the set.
o It is possible for the greatest frequency to correspond to several different
values, which results in more than one mode.
o Data sets with one, two, or three modes: called unimodal, bimodal, and
trimodal.
o At the other extreme, if each data value occurs only once, then there is no
mode.
 Central Tendency Measures for Numerical Attributes: Mean, Median, Mode
 Central Tendency Measures for Categorical Attributes: Mode (Median?)
o Central Tendency Measures for Nominal Attributes: Mode
o Central Tendency Measures for Ordinal Attributes: Mode, Median

Example :What are central tendency measures (mean, median, mode)for the following
attributes? attr1 = {2,4,4,6,8,24}
mean = (2+4+4+6+8+24)/6 = 8 average of all values
median = (4+6)/2 = 5 avg. of two middle values
mode = 4 most frequent item
attr2 = {2,4,7,10,12}
mean = (2+4+7+10+12)/5 = 7 average of all values
17
median = 7 middle value
mode = any of them (no mode) all of them has same freq.

8.2 Measuring the Dispersion of the data


• The degree to which numerical data tend to spread is called the dispersion, or
variance of the data.
The most common measures of data dispersion:
• Range: Difference between the largest and smallest values.
• Interquartile Range (IQR): range of middle 50%
– quartiles: Q1 (25th percentile), Q3 (75th percentile

IQR=Q3-Q1
– five number summary: Minimum, Q1, Median, Q3, Maximum
• Variance and Standard Deviation: (sample: s, population: σ)
variance of N observations
i. Quartiles
 Suppose that set of observations for numeric attribute X is sorted in increasing order.
 Quantiles are points taken at regular intervals of a data distribution, dividing it into
essentially equal size consecutive sets.
 The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets.
 Quartiles: The 4-quantiles are the three data points that split the data distribution into
 four equal parts; each part represents one-fourth of the data distribution.

ii. Outliers
Outliers can be identified by the help of interquartile range or standard deviation measures.
– Suspected outliers are values falling at least 1.5xIQR above the third quartile
or below the first quartile.
– Suspected outliers are values that fall outside of the range of μ–Nσ and μ+Nσ
where μ is mean and σ is standard deviation. N can be chosen as 2.5.

18
The normal distribution curve: (μ: mean, σ: standard deviation)
– From μ–σ to μ+σ: contains about 68% of the measurements
– From μ–2σ to μ+2σ: contains about 95% of it
– From μ–3σ to μ+3σ: contains about 99.7% of it
iii. Boxplot Analysis
• Five-number summary of a distribution:
Minimum, Q1, Median, Q3, Maximum
• Boxplots are a popular way of visualizing a distribution and a boxplot incorporates
five-number summary:
– The ends of the box are at the quartiles Q1 and Q3, so that the box length is
the interquartile range, IQR.
– The median is marked by a line within the box. (median of values in IQR)
– Two lines outside the box extend to the smallest and largest observations
(outliers are excluded). Outliers are marked separately.
• If there are no outliers, lower extreme line is the smallest observation (Minimum) and
upper extreme line is the largest observation (Maximum).

Example: Consider following two attribute values:


attr1: {2,3,4,5,6,7,8,9} attr2: {1,5,9,10,11,12,18,30}
Which attribute has biggest standard deviation? Do not compute standard deviations.
attr2
Give interquartile ranges of attribute values?
attr1: Q1: (3+4)/2=3.5 Q3:(7+8)/2=7.5 IQR:3.5-7.5 = 4
attr2: Q1: (5+9)/2=7 Q3:(12+18)/2=15 IQR:7-15 = 8
Are there any outliers (wrt IQR) in these datasets?
Yes. 30 in attr2. 30 > 15+1.5*IQR
iv. Variance and Distribution
Variance and standard deviation are measures of data dispersion. They indicate how
spread out a data distribution is. A low standard deviation means that the data observations
tend to be very close to the mean, while a high standard deviation indicates that the data are
spread out over a large range of values.

19
8.3 Graphic Display of Basic Statistical Description of the Data
It include quantile plots, quantile–quantile plots, histograms, and scatter plots. Such
graphs are helpful for the visual inspection of data, which is useful for data preprocessing. The
first three of these show univariate distributions (i.e., data for one attribute), while scatter plots
show bivariate distributions (i.e., involving two attributes).
i. Quantile Plots
Displays all of the data (allowing the user to assess both the overall behavior and
unusual occurrences)
• Plots quantile information
– For a data xi data sorted in increasing order, fi indicates that approximately 100
fi% of the data are below or equal to the value xi

ii. Qunatile-Quantile Q-Q Plot


Graphs the quantiles of one univariate distribution against the corresponding quantiles of
another View: Is there is a shift in going from one distribution to another?
Example shows unit price of items sold at Branch 1 vs. Branch 2 for each quantile. Unit
prices of items sold at Branch 1 tend to be lower than those at Branch 2.

A straight line that represents the case of when, for each given quantile, the unit price at each
branch is the same
20
iii. Scatter Plot
A scatter plot is one of the most effective graphical methods for determining if there
appears to be a relationship, pattern, or trend between two numeric attributes.
– To construct a scatter plot, each pair of values is treated as a pair of coordinates in an
algebraic sense and plotted as points in the plane.
• The scatter plot is a useful method for providing a first look at bivariate data to see
clusters of points and outliers, or to explore the possibility of correlation relationships.
• Two attributes, X, and Y, are correlated if one attribute implies the other.
• Correlations can be positive, negative, or null (uncorrelated).

iv. Histograms
• A histogram represents the frequencies of values of a variable bucketed into ranges.
• Histogram is similar to bar chat but the difference is it groups the values into
continuous ranges

21
9. MEASURING DATA SIMILARITY AND DISSIMILARITY.
Similarity
• The similarity between two objects is a numerical measure of the degree to which
thetwo objects are alike.
• Similarities are higher for pairs of objects that are more alike.
• Similarities are usually non-negative and are often between 0 (no similarity) and 1
(complete similarity).
Dissimilarity
• The dissimilarity between two objects is a numerical measure of the degree to which
the two objects are different.
• Dissimilarities are lower for more similar pairs of objects.
• The term distance is used as a synonym for dissimilarity, although the
distance isoften used to refer to a special class of dissimilarities.
• Dissimilarities sometimes fall in the interval [0,1], but it is also common for them to
range from 0 to ∞.
Proximity refers to a similarity or dissimilarity

i. Similarity / Dissimilarity for Simple Attributes


• The proximity of objects with a number of attributes is typically defined by
combining the proximities of individual attributes.
• Consider objects described by one nominal attribute.
– What would it mean for two such objects to be similar?

• p and q are the attribute values for two data objects

22
ii. Dissimilarities between data objects- Distance on Numeric Data : Euclidean Distance

where n is the number of attributes and xk and xk are kth attributes of data objects x and y.
• Normally attributes are numeric.
• Standardization is necessary, if scales of attributes differ.

iii. Distance on Numeric Data : Minkowski Distance

23
iv. Cosine Similarity

Example

24

You might also like