0% found this document useful (0 votes)

544 views

Data Mining

This document discusses data mining and different types of databases. It provides information on data preprocessing, data warehouses, data mining engines, pattern evaluation, and knowledge output. It also defines relational databases, data warehouses, transactional databases, object-oriented RDBMS, and different types of databases like temporal, sequence, spatial, text, multimedia, heterogeneous, legacy, data streams, and web databases. Finally, it discusses various data mining tasks and how data mining systems are classified.

Uploaded by

tulasinad123

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

544 views

Data Mining

Uploaded by

tulasinad123

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 84

Huge amount of Raw DATA is available.

The
Motivation for the Data Mining is to

Analyse,
Classify,
Cluster,
Charecterize the Data etc...

The Databases are PreProcessed i.e. Cleaned and

Integrated and the Data Warehouse is formed.
The Data Warehouse is Selected and Transformed
as per the User Requirement and it is submitted
to the Data Mining Engine.
The Data Mining Engine will run for n iterations/
tuples.
As a Result, We will get some Patterns as Output.
Then the Patterns are Evaluated and finally we
will get an Output which is Knowledge.

RDBMS- A Relational database is a collection of

tables, each ofwhich is assigned a unique name.
Each table consists of a set of attributes (columns
or fields) and usually stores a large set of tuples
(records or rows). Each tuple in a relational table
represents an object identified by a unique key
and described by a set of attribute values.
A semantic data model, such as an entityrelationship (ER) data model, is often constructed
for relational databases.
An ER data model represents the database as a
set of entities and their relationships.

DataWareHouse - A Data warehouse

is a repository of information collected from
multiple sources, stored under a unified schema,
and that usually resides at a single site. Data
warehouses are constructed via a process of data
cleaning, data integration, data transformation,
data loading, and periodic data refreshing.

Transactional DataBase In general, a

transactional database consists of a filewhere
each record represents a transaction. A
transaction
typically
includes
a
unique
transaction identity number (trans ID) and a list
of the items making up the transaction (such as
items purchased in a store).

Object Oriented RDBMS Conceptually, the objectrelational data model inherits the essential concepts of
object-oriented databases, where, in general terms, each
entity is considered as an object. Data and code relating to
an object are encapsulated into a single unit. Each object
has associated with it the following.

A set of variables that describe the objects. These

correspond to attributes in the entity-relationship and
relational models.
A set of messages that the object can use to communicate
with other objects, or with the rest of the database system.
A set of methods, where each method holds the code to
implement a message. Upon receiving a message, the
method returns a value in response. For instance, the
method for the message get photo(employee) will retrieve
and return a photo of the given employee object.

A Temporal database typically stores relational

data that include time-related attributes.These
attributes may involve several timestamps, each
having different semantics.
A Sequence database stores sequences of ordered
events, with or without a concrete notion of time.
Examples include customer shopping sequences,
Web click streams, and biological sequences.
A Time-series database stores sequences of
values
or
events
obtained
over
repeated
measurements of time (e.g., hourly, daily, weekly).
Examples include data collected from the stock
exchange, inventory control, and the observation of
natural phenomena (like temperature and wind).

Spatial
databases
contain spatial-related
information. Examples include geographic (map)
databases, very large-scale integration (VLSI) or
computed-aided design databases, and medical
and satellite image databases.
Text databases are databases that contain word
descriptions for objects. These word descriptions
are usually not simple keywords but rather long
sentences or paragraphs, such as product
specifications, error or bug reports, warning
messages, summary reports,notes, or other
documents.

Multimedia databases store image, audio, and video data.

They are used in applications such as picture content-based
retrieval, voice-mail systems, video-on-demand systems, the
World Wide Web, and speech-based user interfaces that
recognize spoken commands.
A Heterogeneous database consists of a set of
interconnected, autonomous component databases.
The
components communicate in order to exchange information
and answer queries.
Legacy Database formed as a result of long history of IT
Development. A legacy database is a group of heterogeneous
databases that combines different kinds of data systems,
such as relational or object-oriented databases, hierarchical
databases, network databases, spreadsheets, multimedia
databases, or file systems.

Data Streams - Many applications involve the

generation and analysis of a newkind of data,
called stream data, where data flow in and out of
an observation platform (or window) dynamically.
The World Wide Web and its associated
distributed information services, such as Yahoo!,
Google, America Online, and AltaVista, provide
rich, worldwide, on-line information services,
where data objects are linked together to
facilitate interactive access.

Characterization and Discrimination

Mining Frequent Patterns, Associations, and
Correlations
Classification and Prediction
Cluster Analysis
Outlier Analysis
Evolution Analysis

Data characterization is a summarization of the general

characteristics or features of a target class of data. The
data corresponding to the user-specified class are typically
collected by a database query. For example, to study the
characteristics of software products whose sales increased
by 10% in the last year, the data related to such products
can be collected by executing an SQL query.
Data discrimination is a comparison of the general
features of target class data objects with the general
features of objects from one or a set of contrasting classes.
The target and contrasting classes can be specified by the
user, and the corresponding data objects retrieved through
database queries. For example, the user may like to
compare the general features of software products whose
sales increased by 10% in the last year with those whose
sales decreased by at least 30% during the same period.

Frequent patterns, as the name suggests, are patterns that

occur frequently in data. There are many kinds of frequent
patterns,
including
itemsets,
subsequences,
and
substructures.A frequent itemset typically refers to a set of
items that frequently appear together in a transactional data
set, such as milk and bread. A frequently occurring
subsequence,such as the pattern that customers tend to
purchase first a PC, followed by a digital camera, and then a
memory card, is a (frequent) sequential pattern.
buys(X; computer))buys(X; software)
[support = 1%; confidence = 50%]
age(X, 20:::29)^income(X, 20K:::29K))buys(X, CD player)
[support = 2% , confidence = 60%]

Classification is the process of finding a model

(or function) that describes and distinguishes
data classes or concepts, for the purpose of being
able to use the model to predict the class of
objects whose class label is unknown.
How is the derived model presented? The
derived model may be represented in various
forms, such as classification (IF-THEN) rules,
decision trees, mathematical formulae, or neural
networks.
Prediction is the amount of revenue that each
item will generate during an upcoming sale, stock
etc..

What is cluster analysis?Unlike classification and

prediction, which analyze class-labeled data
objects, clustering analyzes data objects without
consulting a known class label.
The objects are clustered or grouped based on
the principle of maximizing the intra class
similarity and minimizing the interclass similarity.

A database may contain data objects that

do not comply with the general behavior or
model of the data. These data objects are
outliers.
However, in some applications such as
fraud detection, the rare events can be
more interesting than the more regularly
occurring ones.

Data evolution analysis describes and models

regularities or trends for objects whose
behavior changes over time. Although this
may include characterization, discrimination,
association
and
correlation
analysis,
classification, prediction, or clustering of time
related data, distinct features of such an
analysis include time-series data analysis,
sequence or periodicity pattern matching,
and similarity-based data analysis.

A data mining system has the potential to generate

thousands or even millions of patterns, or rules. So, you
may ask, are all of the patterns interesting? Typically not
only a small fraction of the patterns potentially generated
would actually be of interest to any given user. Those are
Rules that do not satisfy a confidence threshold of, say,
50% can be considered uninteresting. Rules below the
threshold likely reflect noise , exceptions, or minority cases
and are probably of less value.
patterns are interesting if they are unexpected
(contradicting a users belief).
Patterns that are expected can be interesting if they
confirm a hypothesis that the user wished to validate, or
resemble a users hunch.

The Data Mining Systems are classified according to

kinds of databases mined- relational , transactional,

object-relational, or data warehouse mining system
etc..
kinds
of
knowledge
minedcharacterization,
discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis,
and evolution analysis.
kinds of techniques- autonomous systems, interactive
exploratory systems, query-driven systems.
applications adapted- finance , tele communications,
DNA, stock markets etc..

The set of task-relevant data to be mined

The kind of knowledge to be mined
The background knowledge to be used in the
discovery process
The interestingness measures and thresholds for
pattern evaluation
The expected representation for visualizing the
discovered patterns

Mining methodology and user interaction issues

Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple levels of abstraction
Incorporation of background knowledge
Data mining query languages and ad hoc data mining
Presentation and visualization of data mining results
Handling noisy or incomplete data
Pattern evaluationthe interestingness problem.
Performance issues.
Efficiency and scalability of data mining algorithms
Parallel, distributed, and incremental mining algorithms
Issues relating to the diversity of database types
Handling of relational and complex types of data
Mining information from heterogeneous databases and
global information systems

Why Preprocess the Data ?

Imagine that you are a manager at AllElectronics and have been charged with analyzing
the companys data with respect to the sales at your branch.You carefully inspect the
companys database and data warehouse, identifying and selecting the attributes or
dimensions to be included in your analysis, such as item, price, and units sold.
Alas! You notice that several of the attributes for various tuples have no recorded value.
For your analysis, you would like to include information as to whether each item
purchased was advertised as on sale, yet you discover that this information has not
been recorded.
Furthermore, users of your database system have reported errors, unusual values, and
inconsistencies in the data recorded for some transactions.
In other words, the data you wish to analyze by data mining techniques are incomplete
(lacking attribute values or certain attributes of interest, or containing only aggregate
data), noisy (containing errors, or outlier values that deviate from the expected), and
inconsistent (e.g., containing discrepancies in the department codes used to
categorize items).

Real-world data tend to be incomplete, noisy, and inconsistent.

Data cleaning (or data cleansing) routines attempt to fill in
missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
Missing Values
Ignore the tuple
Fill in the missing value manually
Use a global constant to fill in the missing value
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class as the given
tuple
Use the most probable value to fill in the missing value.

Noisy Data - Noise is a random error or variance in a measured variable.

Noise is Removed in the following three ways
Binning see the next page
Regression - Data can be smoothed by fitting the data to a function
Clustering - Outliers may be detected by clustering, where similar values are
organized into groups, or clusters.

Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
To Detect that Data cleaning is required for a particular Data is
called Discrepancy Detection and it can be done by using
Knowledge and metadata.

It is likely that your data analysis task will involve data

integration, which combines data from multiple sources into a
coherent data store, as in data warehousing. These sources may
include multiple databases, data cubes, or flat files.
There are a number of issues to consider during data integration
Schema integration and object matching
customer id in one database and cust _number in another
Redundancy
Hence we will perform Data integration by using
Metadata
Normalisation
Correlation analysis using 2

The data are transformed or consolidated into forms appropriate

for mining in the following ways.
Smoothing can be done by binning, regression, clustering
Aggregation - the daily sales data may be aggregated so as
to compute monthly and annual total amounts.
Generalization low level data are Replaced by high level
data
i.e. street can be generalized to city or country.
Normalization the data is normalized.
Attribute construction the new attributes are constructed
and added from the given set of attributes.

The data to be Mined will be Generally very Huge, Hence the Data
is Reduced in the following ways
Data cube aggregation
Attribute subset selection
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation

By using the Data cube, all the quarter sales can be aggregated to
yearly sales.
Hence the Huge data of quarterly is Reduced to yearly..

Data sets for analysis may contain hundreds of attributes, many

of which may be irrelevant to the mining task or redundant. For
example, if the task is to classify customers as to whether or not
they are likely to purchase a popular new CD at AllElectronics
when notified of a sale, attributes such as the customers
telephone number are likely to be irrelevant, unlike attributes such
as age or music taste.
The attribute subset selection is done in the following ways
Stepwise forward selection
Stepwise backward elimination
Combination of forward selection and backward elimination
Decision tree induction

Wavelet Transforms
The discrete wavelet transform(DWT) is a linear signal processing technique
that, when applied to a data vector X, transforms it to a numerically
different vector, X0, of wavelet coefficients. The two vectors are of the
same length. When applying this technique to
data reduction, we
consider each tuple as an n-dimensional data vector, that is, X =
(x1;x2; : : : ;xn), depicting n measurements made on the tuple from n
database attributes.
How can this technique be useful for data reduction if the wavelet
transformed data are of the same length as the original data? The
usefulness lies in the fact that the wavelet transformed data can be
truncated. A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet coefficients.

Principal Components Analysis

Suppose that the data to be reduced consist of tuples or data vectors
described by n attributes or dimensions. Principal components analysis, or
PCA (also called the Karhunen-Loeve, or K-L, method), searches for k ndimensional orthogonal vectors that can best be used to represent the data,
where k <= n. The original data are thus projected onto a much smaller
space, resulting in dimensionality reduction.

Can we reduce the data volume by choosing alternative, smaller

forms of data representation?. This can be done as
Regression and Log-Linear Models - the data are modeled to fit
a straight line
Histograms - A histogram for an attribute, A, partitions the data
distribution of A into disjoint subsets, or buckets.
Clustering - They partition the objects into groups or clusters, so
that objects within a cluster are similar to one another and
dissimilar to objects in other clusters.
Sampling - it allows a large data set to be represented by a much
smaller random sample (or subset) of the data.
Data Discretization and Concept Hierarchy Generation pls
see the next page..

Data discretization techniques can be used to reduce the number of values for
a given continuous attribute by dividing the range of the attribute into
intervals. This is done in the following ways
Binning Here we will take the bins with some intervals
Histogram Analysis Here, the histogram partitions the values into buckets.
Entropy-Based Discretization - The method selects the value of A that has
the minimum entropy as a split-point, and recursively partitions the resulting
intervals to arrive at a hierarchical discretization.
Interval Merging by using 2 Analysis This contrasts with ChiMerge,
which employs a bottom-up approach by finding the best neighboring intervals
and then merging these to form larger intervals, recursively.
Cluster Analysis - A clustering algorithm can be applied to discretize a
numerical attribute, A, by partitioning the values of A into clusters
or groups.
Discretization by Intuitive Partitioning - For example, annual salaries
broken into ranges like ($50,000, $60,000] are often more desirable than
ranges
like ($51,263.98, $60,872.34], obtained by, say, some sophisticated clustering
analysis.

Here the concept hierarchy is generated in the following manner.

if there are some attributes like state, street, city, country..then
the concept hierarchy is generated as

Traditional Databases uses OLAP, whereas DatawareHouse uses

OLTP.

The entity-relationship data model is commonly used in the design

of relational databases, where a database schema consists of a
set of entities and the relationships between them. Such a data
model is appropriate for on-line transaction processing.

A data warehouse, however, requires a concise, subject-oriented

schema that facilitates on-line data analysis. The most popular
data model for a data warehouse is a multidimensional model.
Such a model can exist in the form of a star schema, a snowflake
schema, or a fact constellation schema. Lets look at each of these
schema types.

Star schema: The most common modeling paradigm is the star schema, in
which the data warehouse contains (1) a large central table (fact table)
containing the bulk of the data, with no redundancy, and (2) a set of
smaller attendant tables (dimension tables), one for each dimension. The
schema graph resembles a starburst, with the dimension tables displayed
in a radial pattern around the central fact table.

Snowflake schema - For example, the item dimension table now contains
the attributes item key, item name, brand, type, and supplier key, where
supplier key is linked to the supplier dimension table, containing supplier
key and supplier type information. Similarly, the single dimension table for
location in the star schema can be normalized into two new tables:
location and city.

Fact constellation. A fact constellation schema is shown in Figure 3.6. This

schema specifies two fact tables, sales and shipping. The sales table
definition is identical to that of the star schema (Figure 3.4). The shipping
table has five dimensions, or keys: item key, time key, shipper key, from
location, and to location, and two measures: dollars cost and units shipped.
A fact constellation schema allows dimension tables to be shared between
fact tables. For example, the dimensions tables for time, item, and location
are shared between both the sales and shipping fact tables.

Roll-up: The roll-up operation (also called the drill-up operation by some
vendors) performs aggregation on a data cube, either by climbing up a
concept hierarchy for a dimension or by dimension reduction. This
hierarchy was defined as the total order street < city < province or
state < country.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less
detailed data to more detailed data. Drill-down can be realized by either
stepping down a concept hierarchy for a dimension or introducing
additional dimensions. Drill-down operation performed on the central
cube by stepping down a concept hierarchy for time defined as day <
month < quarter < year.
Slice and dice: The slice operation performs a selection on one
dimension of the given cube, resulting in a subcube. The dice operation
defines a subcube by performing a selection on two or more dimensions.
Pivot (rotate): Pivot (also called rotate) is a visualization operation that
rotates the data axes in view in order to provide an alternative
presentation of the data.

1. The bottom tier is a warehouse database server that is almost

always a relational database system. Back-end tools and utilities are
used to feed data into the bottom tier from operational databases or
other external sources (such as customer profile information
provided by external consultants). These tools and utilities perform
data extraction, cleaning, and transformation (e.g., to merge similar
data from different sources into a unified format), as well as load
and refresh functions to update the data warehouse.
2. The middle tier is an OLAP server that is typically implemented
using either (1) a relational OLAP (ROLAP) model, that is, an
extended relational DBMS that maps operations on multidimensional
data to standard relational operations; or (2) a multidimensional
OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.
3. The top tier is a front-end client layer, which contains query and
reporting tools, analysis tools, and/or data mining tools (e.g., trend
analysis, prediction, and so on).

Enterprise warehouse: An enterprise warehouse collects all of the

information about subjects spanning the entire organization. It
provides corporate-wide data integration, usually from one or
more operational systems or external information providers, and is
cross-functional in scope.
Data mart: A data mart contains a subset of corporate-wide data
that is of value to a specific group of users. The scope is confined
to specific selected subjects. For example, a marketing data mart
may confine its subjects to customer, item, and sales.
Virtual warehouse: A virtual warehouse is a set of views over
operational databases. For efficient query processing, only some
of the possible summary views may be materialized.

Relational OLAP (ROLAP) servers: These are the intermediate

servers that stand in between a relational back-end server and
client front-end tools.
Multidimensional OLAP (MOLAP) servers: These servers support
multidimensional
views
of
data
through
array-based
multidimensional storage engines.
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach
combines ROLAP and MOLAP technology, benefiting from the
greater scalability of ROLAP and the faster computation of MOLAP.
Specialized SQL servers: To meet the growing demand of OLAP
processing in relational databases, some database systemvendors
implement specialized SQL servers that provide advanced query
language and query processing support for SQL queries over star
and snowflake schemas in a read-only environment.

Data generalization summarizes data by replacing relatively low-level values

(such as numeric values for an attribute age) with higher-level concepts
(such as young, middle aged, and senior). Given the large amount of data
stored in databases, it is useful to be able to describe concepts in concise
and succinct terms at generalized (rather than low) levels of abstraction.
Attribute-Oriented Induction for Data Characterization
Before Attribute Induction 1) First, data focusing should be performed before attribute-oriented
induction. This step corresponds to the specification of the task-relevant
data (i.e., data for analysis). The data are collected based on the information
provided in the data mining query.
2) Specifying the set of relevant attributes. For example, suppose that the
dimension birth place is defined by the attributes city, province or state, and
country. Of these attributes, lets say that the user has only thought to
specify city. In order to allow generalization on the birth place dimension, the
other attributes defining this dimension
should also be included.

3) A correlation-based (Section 2.4.1) or entropy-based (Section 2.6.1) analysis

method can be used to perform attribute relevance analysis and filter out statistically
irrelevant or weakly relevant attributes from the descriptive mining process.

Attribute Induction takes place in two phases

1) Attribute Removal - If there is a large set of distinct values for an attribute of the
initial working relation, but either (1) there is no generalization
operator on the attribute (e.g., there is no concept hierarchy defined for the attribute),
or (2) its higher-level concepts are expressed in terms of other attributes, then the
attribute should be removed from the working relation.
2) Attribute Generalization - If there is a large set of distinct values for an attribute in
the initial working relation, and there exists a set of generalization operators on the
attribute, then a generalization operator should be selected and applied to the
attribute.

Attribute Generalization can be controlled in 2 ways

1) Attribute generalization threshold control - sets one threshold
for each attribute. If the number of distinct values in an attribute
is greater than the attribute threshold, further attribute removal or
attribute generalization should be performed.
2) Generalized relation threshold control - sets a threshold
for the generalized relation. If the number of (distinct) tuples in
the generalized relation is greater than the threshold, further
generalization should be performed.

Presentation of the Derived Generalization

The above table is represented in the form of a Bar chart & Pie
chart in the following manner.

It is Represented in the form of a 3-D cube in the following way.

Mining Class Comparisons - In many applications, users may not

be interested in having a single class (or concept) described or
characterized, but rather would prefer to mine a description that
compares or distinguishes one class (or concept) from other
comparable classes (or concepts).
The class comparison is done in the following procedure
1) Data collection - The set of relevant data in the database is
collected by query processing and is partitioned respectively into a
target class and one or a set of contrasting class(es).
2) Dimension relevance analysis - If there are many dimensions, then
dimension relevance analysis should be performed on these classes
to select only the highly relevant dimensions for further analysis.
Correlation or entropy-based measures can be used for this step.

3) Synchronous generalization Generalization is performed on

the target class to the level controlled by a user- or expertspecified dimension threshold, which results in a prime target
class relation.

4) Presentation of the derived comparison - The resulting class

comparison description can be visualized in the form of tables,
graphs, and rules.

Frequent patterns are patterns (such as itemsets,

subsequences, or substructures) that appear in
a
data set frequently. For example, a set of items, such as
milk and bread, that appear frequently together in a
transaction data set is a frequent itemset. A subsequence,
such as buying first a PC, then a digital camera, and then a
memory card, if it occurs frequently in a shopping history
database, is a (frequent) sequential pattern.
Finding such frequent patterns plays an essential role in
mining associations, correlations, and many other
interesting relationships among data.
Moreover, it helps in data classification, clustering, and
other data mining tasks as well.

Mainly Association mining deals with

Efficient and Scalable Frequent Item set Mining

Methods using Apriori Algorithm
Mining Multilevel Association Rules
Mining Multidimensional Association Rules
from Relational Databases and Data
Warehouses
Association Mining to Correlation Analysis
Constraint-Based Association Mining

Consider a particular Departmental store

where it shows some transactions
Tid

items

Bread, milk

Bread, diapers, beer, eggs

Milk, diapers, beer, cola

Bread, milk, diapers, beer

Bread, milk, diapers, cola

The above table can be represented in

binary format as below
Tid

Bread

milk

diaper
s

Beer

Eggs

cola

Total

Then the 1-item sets are generated from the

binary table ..i.e.

Item

Count

Beer

Bread

Cola

Diapers

Milk

Eggs

Then by taking the support threshold value as 60% (5

transactions) i.e. minimum support count as 3, the cola and
eggs are discarded from the item sets as they have less
than 3(threshold).
from the 1-item sets, 2-item sets are generated as 4c2.
i.e. out of the 6 items in 1-item set, 2 are discarded and 4
are remaining and 2 is to generate 2-item set. Then the
2-item sets are..
Item set

Count

Beer, bread

Beer, diapers

Beer, milk

Bread, diapers

Bread, milk

Diapers, milk

In the 2-item sets, (beer, bread) & (beer, milk)

are discarded as they have less than 3(threshold).
And the 3-item sets are generated as 4c3.
The item set which is having more than 3
(threshold).
we will not have 4-item set as there is only 1 pair
formed in 3-item set. if we have 2 pairs, then the
4-item set can be formed by combining 2 pairs.

Item set

Count

Bread, diapers, milk

As a conclusion, Apriori generated 1,2,3-item sets

as
6c1+4c2+4c3 = 6+6+4 =16
But according to brute force strategy, the same is
done as
6c1+6c2+6c3=6+15+20 = 41.
Hence, Apriori generates Optimum and Required
number of item sets.

If we have any database which is following

concept hierarchy as below,

Then, we will adopt two methods as follows

Using uniform minimum support for all levels (referred to as

uniform support) i.e. it checks for the minimum support value for
all levels.
Using reduced minimum support at lower levels (referred to
as reduced support) i.e. Here the support values are reduced to
some extent to have the Associations.

1) buys(X, laptop computer))buys(X, HP printer)

[support = 8%, confidence = 70%]
2) buys(X, IBM laptop computer))buys(X, HP printer)
[support = 2%, confidence = 72%]
From the above 2 Rules, Rule 1 is considered as it is the Generalized one and
also the support value is more comparing with Rule 2.

1) buys(X, digital camera))->buys(X, HP printer)

Rule 1 is having only one Predicate buys and we will call it as
single dimensional Rule.
2)
age(X,
20:::29)^occupation(X,
student))->buys(X,
laptop)
Rule 2 is having 3 predicates age, occupation, buys. Hence
we will call this as multi dimensional Rule.
We will Represent the multi dimensional Rules in the following
way..

We will follow 3 methods to this type of Rules

Equal-width binning:

Eg:

age(X, 30:::35)^income(X, 40:::45K))->buys(X, HDTV)

age(X, 35:::40)^income(X, 45K:::50K))->buys(X, HDTV)
If we observe, the same interval is taken and the Rules are
generated.

Equal-frequency binning:
age(X,
age(X,
age(X,
age(X,
Here ,for

30:::31)^income(X, 40:::41K))->buys(X, HDTV)

31:::32)încome(X, 41K:::42K))->buys(X, HDTV)
32:::33)încome(X, 42:::43K))->buys(X, HDTV)
33:::34)încome(X, 43K:::44K))->buys(X, HDTV)
a Bin,equal number of tuples will be taken.

Clustering-based binning:
Consider the following tuples, then we can group/cluster them as
below.
age(X, 34)încome(X, 31K:::40K))->buys(X, HDTV)
age(X, 35)încome(X, 31K:::40K))->buys(X, HDTV)
age(X, 34)încome(X, 41K:::50K))->buys(X, HDTV)
age(X, 35)încome(X, 41K:::50K))->buys(X, HDTV)
Then we can form a 2-D Grid and cluster them to get the HDTV
Purchase Zone.

Correlation between A and B is measured not only by its support and

confidence but also by the correlation between item sets A and B.
It is done by using two parameters.
1)lift(A, B) =P(AB)/P(A)P(B).
2) Correlation analysis using 2
Consider an example: Of the 10,000 transactions analyzed, the data show that
6,000 of the customer transactions included computer games, while 7,500
included videos, and 4,000 included both computer games and videos.
P(game) = 0.6 , p(video)=0.75 & p(game,video)=0.4.
Then for the Rule
buys(X, computer games))-> buys(X, videos)
[support = 40%, confidence = 66%]

lift(game,video) =

p(game,video) / (P(game) * p(video) )

lift(game,video) = 0.4/(0.6*0.75) = 0.89.

If lift < 1, A & B are said to have negative correlation
If lift > 1, A & B are said to have positive correlation
If lift = 1, A & B are said to be Independent.
Hence lift(game, video) have negative correlation.
Correlation analysis using 2 :
Here we will consider, !game, !video ..i.e. people not playing games,
watching videos..
From the above example,
P(video,game) = 0.4 p(video,!game)=7500-4000/10000 = 0.35
P(!video,game)= 6000-4000/10000 = 0.20
P(!video,!game) = 500/10000=0.05.
Lets look at the table below for the calculation
Correlation analysis using 2 = 555.6
As Correlation analysis using 2 >1 ,then A and B are negatively
correlated and if <1 they are positively correlated.

Lets consider the Rule

age(X, 30:::39)^income(X, 41K:::60K))->buys(X, office
software)
We have not only age, income predicates, but we will
have lot of other predicates..and each predicate will have
some Range of values..
So, for which set of predicates and for which set of values,
the predicate buys will be maximum..
We will consider the best set of predicates, best set of values
to have the maximum buys predicate.
That is Constraint-Based Association Mining.

classification is where a model or classifier is constructed to

analyse safe or risky for the loan application data, yes or
no for the marketing data etc..as in the following table.
prediction is to predict the yes or no for a particular new tuple
X = (age = youth, income = medium, student = yes, credit rating = fair)
Buys_computer Table

Mainly Classification is done in 5 ways

1) Decision tree Classification
2) Bayesian Classification
3) Rule Based Classification
4) Back propagation Classification
5) Associative Classification

Decision tree classification - Decision tree induction is the

learning of decision trees from class-labeled training tuples. A
decision tree is a flowchart-like tree structure, where each internal
node (non leaf node) denotes a test on an attribute, each branch
represents an out come of the test, and each leaf node (or
terminal node) holds a class label. The topmost node in a tree is
the root node.

Information Gain /Entropy from the above Buys_computer

table,
Gain(A) = Info(D) Infoa (D)
Info(D) for the complete table 9 yes, 5 no, 14 tuples
Info(D) = -9/14 log2 (9/14) - 5/14 log2 (5/14) = 0.940 bits
Infoa (D) for the attribute Age which has
Youth total 5 values out of 14 tuples, 2 yes and 3 no
Middle aged - total 4 values out of 14 tuples, 4 yes and 0 no
Senior - total 5 values out of 14 tuples, 3 yes and 2 no
Infoage (D)=5/14 x (-2/5 log2 2/5 3/5 log2 3/5)+
4/14 x (-4/4 log2 4/4 0/4 log2 0/4)+
5/14 x (-3/5 log2 3/5 2/5 log2 2/5)
=0.694 bits
Gain(A) = 0.940 0.694 = 0.246 bits

Best Classifier Attribute - Entropy or Gain is calculated for

all the Attributes Age, Income, Student, Credit_rating and the
Attribute having Maximum Entropy value will be the Best Classifier
Attribute.
Entropy(Age) is having the Max value..Hence Age is acting as
the Best classifier Attribute in the following figure.

They can predict class membership probabilities, such as the probability

that a given tuple belongs to a particular class.

Bayes Theorem Let X be a data tuple. In Bayesian terms, X is considered evidence. As

usual, it is described by measurements made on a set of n attributes. Let H
be some hypothesis, such as that the data tuple X belongs to a specified
class C. For classification problems, we want to determine P(H/X), the
probability that the hypothesis H holds given the evidence or observed
data tuple X. In other words, we are looking for the probability that tuple X
belongs to class C, given that we know the attribute description of X.
P(H/X) is the posterior probability..
P(H/x) =P(X/H)P(H) / p(X)

The above Buys_computer Table is classified and if we need to predict for

a new tuple X,
X = (age = youth, income = medium, student = yes, credit rating =
fair),
By using Bayes Theorem, we can calculate in the following way..
P(buys computer = yes) = 9/14 = 0.643
P(buys computer = no) = 5/14 = 0.357
To compute PX/Ci), for i = 1, 2, we compute the following conditional
probabilities:
P(age = youth / buys computer = yes) = 2/9 = 0.222
P(age = youth / buys computer = no) = 3/5 = 0.600
P(income = medium / buys computer = yes) = 4/9 = 0.444
P(income = medium / buys computer = no) = 2/5 = 0.400
P(student = yes / buys computer = yes) = 6/9 = 0.667
P(student = yes / buys computer = no) = 1/5 = 0.200
P(credit rating = fair / buys computer = yes) = 6/9 = 0.667
P(credit rating = fair / buys computer = no) = 2/5 = 0.400

Using the above probabilities, we obtain

P(X/buys computer = yes) = P(age = youth / buys computer = yes) x
P(income = medium / buys computer = yes) x
P(student = yes / buys computer = yes) x
P(credit rating = fair / buys computer = yes)
= 0.222x0.444x0.667x0.667 = 0.044
Similarly,
P(X/buys computer = no) = 0.600x0.400x0.200x0.400 = 0.019
To find the class, Ci, that maximizes P(X/Ci) P(Ci), we compute
P(X/buys computer = yes) P(buys computer = yes) = 0.044x0.643 = 0.028
P(X/buys computer = no) P(buys computer = no) = 0.019x0.357 = 0.007
Therefore, the nave Bayesian classifier predicts buys computer =
yes for tuple X.

Rule Based Classification - In this section, we look at rule-based

classifiers, where the learned model is represented
THEN rules.

as a set of IF-

Rules are a good way of representing information or bits of knowledge. A

rule-based classifier uses a set of IF-THEN rules for classification. An IF-THEN
rule is an expression of the form
IF condition THEN conclusion.
An example is rule R1,
R1: IF age = youth
yes.

AND

student = yes THEN

buys computer =

The IF-part (or left-hand side)of a rule isknownas the rule antecedent or
precondition. The THEN-part (or right-hand side) is the rule consequent.
R1 can also be written as
R1: (age = youth) ^ (student = yes))->(buys computer = yes).

The process of grouping a set of physical or abstract objects into classes of

similar objects is called clustering. A cluster is a collection of data objects
that are similar to one another within the same cluster and are dissimilar
to the objects in other clusters.
Types of Data in Cluster Analysis
1) Data matrix
2) Dissimilarity matrix
3) Interval scaled variables - examples include weight and height, latitude
and longitude coordinates (e.g., when clustering houses), and weather
temperature.
4) Binary variables 0 & 1

Partitioning Methods
Given D, a data set of n objects, and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions (k n), where each
partition represents a cluster.
THE K-MEANS METHOD
Input:
k: the number of clusters,
Output: A set of k clusters.

D: a data set containing n objects.

Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for
each cluster;
(5) until no change;

Clustering by k-means partitioning.

Suppose that there is a set of objects located in space as depicted in the
rectangle shown in Figure 7.3(a). Let k = 3; that is, the user would like the
objects to be partitioned into three clusters.
According to the algorithm in Figure 7.2, we arbitrarily choose three objects
as the three initial cluster centers, where cluster centers are marked by a
+. Each object is distributed to a cluster based on the cluster center to
which it is the nearest. Such a distribution forms silhouettes encircled by
dotted curves, as shown in Figure 7.3(a).
Next, the cluster centers are updated. That is, the mean value of each
cluster is recalculated based on the current objects in the cluster. Using the
new cluster centers, the objects are redistributed to the clusters based on
which cluster center is the nearest. Such a redistribution forms new
silhouettes encircled by dashed curves, as shown in Figure 7.3(b).
This process iterates, leading to Figure 7.3(c). The process of iteratively
reassigning objects to clusters to improve the partitioning is referred to as
iterative relocation. Eventually, no redistribution of the objects in any
cluster occurs, and so the process terminates.
The resulting clusters are returned by the clustering process.

Clustering of a set of objects based on the k-means method. (The mean of

each cluster is marked by a +.)

Welcome Speech On Orientation Program
66% (29)
Welcome Speech On Orientation Program
9 pages
Welcome Speech On Orientation Program
66% (29)
Welcome Speech On Orientation Program
9 pages
SMDM Project Report - Shubham Bakshi - 07.05.2023
0% (1)
SMDM Project Report - Shubham Bakshi - 07.05.2023
23 pages
Project: Animesh Halder
67% (3)
Project: Animesh Halder
12 pages
The Nature of History by Arthur Marwick
78% (9)
The Nature of History by Arthur Marwick
3 pages
E-Commerce Revenue Management - Python For Data Science - Great Learning
100% (1)
E-Commerce Revenue Management - Python For Data Science - Great Learning
4 pages
Lecture 1 Data Mining
No ratings yet
Lecture 1 Data Mining
51 pages
Data Mini Proj
100% (2)
Data Mini Proj
44 pages
Machine Learning Assignment Report - Cars
100% (4)
Machine Learning Assignment Report - Cars
42 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Making Sense of Data I: A Practical Guide to Exploratory Data Analysis and Data Mining
From Everand
Making Sense of Data I: A Practical Guide to Exploratory Data Analysis and Data Mining
Glenn J. Myatt
No ratings yet
Company Profile of Sharekhan
0% (1)
Company Profile of Sharekhan
15 pages
Funds Flow Statement
No ratings yet
Funds Flow Statement
67 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
No ratings yet
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
Data Science and Big Data Analysis Mcqs
No ratings yet
Data Science and Big Data Analysis Mcqs
53 pages
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
No ratings yet
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
4 pages
Answer Report (Preditive Modelling)
100% (1)
Answer Report (Preditive Modelling)
29 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
SQL Quiz
No ratings yet
SQL Quiz
4 pages
Amity School of Engineering and Technology: Submitted To
No ratings yet
Amity School of Engineering and Technology: Submitted To
28 pages
Project 4 - Cars-Datasets PDF
100% (2)
Project 4 - Cars-Datasets PDF
44 pages
Data Mining Quiz 3 - Random Forest: Course Content
No ratings yet
Data Mining Quiz 3 - Random Forest: Course Content
8 pages
Quiz Data Mining
50% (6)
Quiz Data Mining
3 pages
Project Title - : Submitted To-Mrs. Mona Adlakha
No ratings yet
Project Title - : Submitted To-Mrs. Mona Adlakha
20 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Data Mining Quiz
No ratings yet
Data Mining Quiz
4 pages
Extended Project
No ratings yet
Extended Project
1 page
ML Notes
100% (2)
ML Notes
125 pages
Data Mining
No ratings yet
Data Mining
7 pages
P&S Question Bank (24-25)
No ratings yet
P&S Question Bank (24-25)
26 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Predictive Modelling
67% (3)
Predictive Modelling
64 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
Customer Churn Analysis and Prediction
No ratings yet
Customer Churn Analysis and Prediction
4 pages
Data Mining - Assignment: Girish Nayak
100% (1)
Data Mining - Assignment: Girish Nayak
21 pages
Report On Linear Regression Using R
No ratings yet
Report On Linear Regression Using R
15 pages
Credit Card Default Prediction: Final Project Report
No ratings yet
Credit Card Default Prediction: Final Project Report
28 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
SQL Syntax
No ratings yet
SQL Syntax
321 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
100% (1)
Predictive Modelling - Linear Discriminant Analysis - Mentor Version - Jupyter Notebook
25 pages
Project Report - Data Mining
0% (1)
Project Report - Data Mining
52 pages
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
No ratings yet
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
22 pages
Tushar Tukaram Bhakare: Education Skills
No ratings yet
Tushar Tukaram Bhakare: Education Skills
1 page
SMDM Report
No ratings yet
SMDM Report
12 pages
Linear Regression Review
67% (3)
Linear Regression Review
4 pages
Project ML
100% (4)
Project ML
36 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
No ratings yet
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
18 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
Linear Regression Hands-On
No ratings yet
Linear Regression Hands-On
27 pages
Churn Prediction
100% (3)
Churn Prediction
41 pages
Missing Value Treatment
No ratings yet
Missing Value Treatment
22 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
Machine Learning Project On Cars
92% (13)
Machine Learning Project On Cars
22 pages
Solution To Problem 1: Importing The Libraries
No ratings yet
Solution To Problem 1: Importing The Libraries
6 pages
PM Guided Project Sample Business Report
No ratings yet
PM Guided Project Sample Business Report
52 pages
SMDM Project Solved
0% (1)
SMDM Project Solved
27 pages
Data Mining
100% (4)
Data Mining
9 pages
Artificial Neural Networks Kluniversity Course Handout
No ratings yet
Artificial Neural Networks Kluniversity Course Handout
18 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
PM P L Lohitha 12-12-22 Business Report
100% (1)
PM P L Lohitha 12-12-22 Business Report
31 pages
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
From Everand
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
Abhishek Vijayvargia
No ratings yet
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Customer Satisfaction
No ratings yet
Customer Satisfaction
72 pages
Marketing Project On Eureka Forbes
100% (2)
Marketing Project On Eureka Forbes
79 pages
Training and Development in Incap
0% (2)
Training and Development in Incap
83 pages
Time Management - Self Assessment
No ratings yet
Time Management - Self Assessment
1 page
Proper Attire and Etiquette For Men and Women
100% (2)
Proper Attire and Etiquette For Men and Women
6 pages
Customer Satisfaction in Bank
100% (1)
Customer Satisfaction in Bank
93 pages
Notes On Derivatives With Example
No ratings yet
Notes On Derivatives With Example
7 pages
Sample Training Development Project
No ratings yet
Sample Training Development Project
72 pages
System Development Lifecycle
No ratings yet
System Development Lifecycle
15 pages
Training & Development Questionnaire
87% (68)
Training & Development Questionnaire
25 pages
Credit Control Methods of RBI or Quantitative and Qualitative Measures of RBI
100% (10)
Credit Control Methods of RBI or Quantitative and Qualitative Measures of RBI
2 pages
Principles of Management Notes
No ratings yet
Principles of Management Notes
61 pages
Perfect Competition
No ratings yet
Perfect Competition
23 pages
Grievance Procedure
No ratings yet
Grievance Procedure
21 pages
Fact Sheets
No ratings yet
Fact Sheets
6 pages
Lectures On Virtual Environment Development L14
No ratings yet
Lectures On Virtual Environment Development L14
34 pages
ADM 4615 & MBA 6623 - Fall 2018 H.W. #1
No ratings yet
ADM 4615 & MBA 6623 - Fall 2018 H.W. #1
11 pages
Water Use in Fabric Garden Containers
No ratings yet
Water Use in Fabric Garden Containers
2 pages
Chapter 4 - BUSINESS STATISTICS
No ratings yet
Chapter 4 - BUSINESS STATISTICS
14 pages
scientificmethod-variablesworksheet-120219155013-phpapp02
No ratings yet
scientificmethod-variablesworksheet-120219155013-phpapp02
9 pages
Ba Eng dbk+4
No ratings yet
Ba Eng dbk+4
3 pages
Evidence For Slavic Lead Mining and Trade Early Rus - 2024 - Journal of Archaeo
No ratings yet
Evidence For Slavic Lead Mining and Trade Early Rus - 2024 - Journal of Archaeo
10 pages
(Ebook) The Psychology of Problem Solving by Janet E. Davidson, Robert J. Sternberg ISBN 9780521793339, 9780521797412, 0521793335, 0521797411 - Instantly access the full ebook content in just a few seconds
100% (1)
(Ebook) The Psychology of Problem Solving by Janet E. Davidson, Robert J. Sternberg ISBN 9780521793339, 9780521797412, 0521793335, 0521797411 - Instantly access the full ebook content in just a few seconds
55 pages
Probability and Statistics & Complex Variables
No ratings yet
Probability and Statistics & Complex Variables
37 pages
Length Check Latex Template For Preparing An Article For Submission To Optica Publishing Group Journals Ao Jocn Josa A Josa B Ol Optica
No ratings yet
Length Check Latex Template For Preparing An Article For Submission To Optica Publishing Group Journals Ao Jocn Josa A Josa B Ol Optica
4 pages
Logic Problemset PDF
100% (2)
Logic Problemset PDF
108 pages
Physics Formula List
50% (4)
Physics Formula List
4 pages
Bio-Biomedical Graduate Programs Dropping GRE Requirement
No ratings yet
Bio-Biomedical Graduate Programs Dropping GRE Requirement
2 pages
MATH 231-Statistics-Hira Nadeem PDF
No ratings yet
MATH 231-Statistics-Hira Nadeem PDF
3 pages
Four Tracks in Shs
No ratings yet
Four Tracks in Shs
22 pages
How Mixed Must A Mixed System Be?
No ratings yet
How Mixed Must A Mixed System Be?
17 pages
Subscription-Management Case Study - Solunus Inc Final
No ratings yet
Subscription-Management Case Study - Solunus Inc Final
1 page
Excel Functions of Decile and Percentile
No ratings yet
Excel Functions of Decile and Percentile
6 pages
Poetry Lesson Plan
0% (1)
Poetry Lesson Plan
5 pages
Sys Eqns Applications 3x3
No ratings yet
Sys Eqns Applications 3x3
11 pages
Assignment 3 BSE5
No ratings yet
Assignment 3 BSE5
7 pages
LCS Final Term
No ratings yet
LCS Final Term
16 pages
4 - Assignment 1 Situation Analysis
No ratings yet
4 - Assignment 1 Situation Analysis
3 pages
Resume Sample
No ratings yet
Resume Sample
3 pages
Research Publication and Popularization
No ratings yet
Research Publication and Popularization
55 pages
Motion Class 9
No ratings yet
Motion Class 9
16 pages
Automatic Control (Part 2) : Frequency Domain Analysis
No ratings yet
Automatic Control (Part 2) : Frequency Domain Analysis
4 pages
Time Tracking Workbook v3
No ratings yet
Time Tracking Workbook v3
16 pages