Data Mining
Data Mining
The
Motivation for the Data Mining is to
Analyse,
Classify,
Cluster,
Charecterize the Data etc...
Object Oriented RDBMS Conceptually, the objectrelational data model inherits the essential concepts of
object-oriented databases, where, in general terms, each
entity is considered as an object. Data and code relating to
an object are encapsulated into a single unit. Each object
has associated with it the following.
Spatial
databases
contain spatial-related
information. Examples include geographic (map)
databases, very large-scale integration (VLSI) or
computed-aided design databases, and medical
and satellite image databases.
Text databases are databases that contain word
descriptions for objects. These word descriptions
are usually not simple keywords but rather long
sentences or paragraphs, such as product
specifications, error or bug reports, warning
messages, summary reports,notes, or other
documents.
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
To Detect that Data cleaning is required for a particular Data is
called Discrepancy Detection and it can be done by using
Knowledge and metadata.
The data to be Mined will be Generally very Huge, Hence the Data
is Reduced in the following ways
Data cube aggregation
Attribute subset selection
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
By using the Data cube, all the quarter sales can be aggregated to
yearly sales.
Hence the Huge data of quarterly is Reduced to yearly..
Wavelet Transforms
The discrete wavelet transform(DWT) is a linear signal processing technique
that, when applied to a data vector X, transforms it to a numerically
different vector, X0, of wavelet coefficients. The two vectors are of the
same length. When applying this technique to
data reduction, we
consider each tuple as an n-dimensional data vector, that is, X =
(x1;x2; : : : ;xn), depicting n measurements made on the tuple from n
database attributes.
How can this technique be useful for data reduction if the wavelet
transformed data are of the same length as the original data? The
usefulness lies in the fact that the wavelet transformed data can be
truncated. A compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet coefficients.
Data discretization techniques can be used to reduce the number of values for
a given continuous attribute by dividing the range of the attribute into
intervals. This is done in the following ways
Binning Here we will take the bins with some intervals
Histogram Analysis Here, the histogram partitions the values into buckets.
Entropy-Based Discretization - The method selects the value of A that has
the minimum entropy as a split-point, and recursively partitions the resulting
intervals to arrive at a hierarchical discretization.
Interval Merging by using 2 Analysis This contrasts with ChiMerge,
which employs a bottom-up approach by finding the best neighboring intervals
and then merging these to form larger intervals, recursively.
Cluster Analysis - A clustering algorithm can be applied to discretize a
numerical attribute, A, by partitioning the values of A into clusters
or groups.
Discretization by Intuitive Partitioning - For example, annual salaries
broken into ranges like ($50,000, $60,000] are often more desirable than
ranges
like ($51,263.98, $60,872.34], obtained by, say, some sophisticated clustering
analysis.
Star schema: The most common modeling paradigm is the star schema, in
which the data warehouse contains (1) a large central table (fact table)
containing the bulk of the data, with no redundancy, and (2) a set of
smaller attendant tables (dimension tables), one for each dimension. The
schema graph resembles a starburst, with the dimension tables displayed
in a radial pattern around the central fact table.
Snowflake schema - For example, the item dimension table now contains
the attributes item key, item name, brand, type, and supplier key, where
supplier key is linked to the supplier dimension table, containing supplier
key and supplier type information. Similarly, the single dimension table for
location in the star schema can be normalized into two new tables:
location and city.
Roll-up: The roll-up operation (also called the drill-up operation by some
vendors) performs aggregation on a data cube, either by climbing up a
concept hierarchy for a dimension or by dimension reduction. This
hierarchy was defined as the total order street < city < province or
state < country.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less
detailed data to more detailed data. Drill-down can be realized by either
stepping down a concept hierarchy for a dimension or introducing
additional dimensions. Drill-down operation performed on the central
cube by stepping down a concept hierarchy for time defined as day <
month < quarter < year.
Slice and dice: The slice operation performs a selection on one
dimension of the given cube, resulting in a subcube. The dice operation
defines a subcube by performing a selection on two or more dimensions.
Pivot (rotate): Pivot (also called rotate) is a visualization operation that
rotates the data axes in view in order to provide an alternative
presentation of the data.
The above table is represented in the form of a Bar chart & Pie
chart in the following manner.
items
Bread, milk
Bread
milk
diaper
s
Beer
Eggs
cola
Total
Item
Count
Beer
Bread
Cola
Diapers
Milk
Eggs
Count
Beer, bread
Beer, diapers
Beer, milk
Bread, diapers
Bread, milk
Diapers, milk
Item set
Count
Eg:
Equal-frequency binning:
age(X,
age(X,
age(X,
age(X,
Here ,for
Clustering-based binning:
Consider the following tuples, then we can group/cluster them as
below.
age(X, 34)^income(X, 31K:::40K))->buys(X, HDTV)
age(X, 35)^income(X, 31K:::40K))->buys(X, HDTV)
age(X, 34)^income(X, 41K:::50K))->buys(X, HDTV)
age(X, 35)^income(X, 41K:::50K))->buys(X, HDTV)
Then we can form a 2-D Grid and cluster them to get the HDTV
Purchase Zone.
lift(game,video) =
as a set of IF-
AND
buys computer =
The IF-part (or left-hand side)of a rule isknownas the rule antecedent or
precondition. The THEN-part (or right-hand side) is the rule consequent.
R1 can also be written as
R1: (age = youth) ^ (student = yes))->(buys computer = yes).
Partitioning Methods
Given D, a data set of n objects, and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions (k n), where each
partition represents a cluster.
THE K-MEANS METHOD
Input:
k: the number of clusters,
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for
each cluster;
(5) until no change;