Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Generalization

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Data Generalization:

In general, data generalization summarizes data by replacing relatively low-level


Values (e.g., numeric values for an attribute age) with higher-level concepts (e.g., young,
Middle-aged and senior), or by reducing the number of dimensions to summarize data
In concept space involving fewer dimensions (e.g., removing birth date and telephone
Number when summarizing the behavior of a group of students). Methods for efficient and
flexible generalization of large data set can be categorized according two approaches:

1. The data cube (OLAP) approaches


2. The Attribute-Oriented indication (AOI) approaches
Data Cube Approaches:
 In Data Cube approaches to data generalization data for analysis are
stored in a multidimensional database or data cube.
 The data cube approaches materializes data cube by first identifying
expenses computations required for frequently-processed quarries
 These operations typically aggregate functions, Such as count(), sum(),
average(), max()
 These compactions are performed and their results are stored in data
cube
 These materialized views can be then used for decision support,
knowledge discovery, and many other applications
 A set of attributes may form a hierarchy or a lattice, structure, defining a
data cube dimension

Advantage:
 Since many aggregate functions need to be computed repeatedly in data
analysis, the storage of pre-computed results in multidimensional data
cube may ensure fast response time.
 It offers flexible views of data from different angles and at different level
of abstraction.
 An efficient implementation Data generalization.
Disadvantage:
 It cannot answer some important questions which concept
description can such as which dimensions should be used in the
description, and what levels should the generalization process
reach.
 Lack of intelligent analysis
The Attribute-Oriented indication (AOI) approach:

 The Attribute-Oriented Indication (AOI) approach to data generalization and


summarization-based characterization was first proposed in 1989, a few years
prior to the indication of the data cube approach.
 The Data cube approach can be considered as a data warehouse-based, pre-
computation-oriented, materialized view approach. It performs online
aggregation before an OLAP or data mining query is submitted for processing.
 On the other hand, the AOI approach at least in its initial proposal, is a relational
data base query-oriented, generalization based, online data analysis technique.
 The general idea for of attribute-oriented indication is to first collect the task-
relevant data using a rational database query and then perform generalization
based on the examination of the number of distinct vales of each attribute in the
relevant set of data.

The generalization performed by either

I. Attribute removal
II. Attribute generalization (also known as concept hierarchy ascension)
Aggregation is performed by merging identical, generalized tuples and
accumulating their respective counts. The resulting generalized relation can
be mapped into different forms for presentation to user such as charts or
rules.

Steps in AOI Approach:


1. Data Focusing should be performed prior to attribute-oriented induction. This step
corresponds to the specification of the task- relevant data.
2. The working relation is scanned once to collect statistics on the number of distinct
values per attribute.
3. Transforming a data mining query to a relational query.
4. Now that the data are ready for attribute-oriented indication, this step refers to
attribute removal where generalization threshold for each attribute help identify
attributes having a large number of distinct values as candidates for attribute removal.
5. In this step to we perform attribute generalization
6. Aggregation is performed by merging identical, generalized tuples and accumulating
their respective count. This reduces the size of generalized dataset.
7. Presentation of the generalized relation.

Attribute Removal:
Attribute removal is based on the following rules:

1. If there is a large dataset of distinct values for an attribute of the initial working
relation, but there is no generalization operator on the attribute. Then that attribute
should be removed because it cannot be generalized and preserving it would imply
keeping a large number of disjuncts which contradicts the goal of generating concise
rules…
 These rules corresponds to the generalization rule know as Dropping
conditions in the machine learning literature on learning-from-examples
2. If higher level concepts are expressed in terms of other attributes, then the
attribute should be removed from the working relation.
 For example, suppose that the attribute in question street, whose higher
concepts represented by the attribute (city, province or state, country).
The removal of street is equivalent to the application of a generalization
operator.
 This corresponds to the generalization rule know as Climbing
Generalization.

Attribute Generalization:

Attribute generalization is based on the following rule:

1. If there is a large dataset of distinct values for an attribute in the initial working
relation, and there exists a set of generalization operators on the attribute, then a
generalization operator should be selected and applied to the attribute.
2. This rule is based on the following reasoning.
 Use of generalization operator to generalize an attribute value within a tuple,
or rule, in the working relation will make the rule cover more of the original
data tuples, thus generalizing the concept it represents.
 This corresponds to the generalization rule know as climbing generalization
tress in learning-from-examples.

You might also like