Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Chapter-3 data processing

TYBCS data processing chapter2

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Chapter-3 data processing

TYBCS data processing chapter2

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

What is Data Preprocessing ?

Data preprocessing is a data mining technique that


involves transforming raw data into an understandable
format. Real-world data is often incomplete,
inconsistent, and/or lacking in certain behaviors or
trends, and is likely to contain many errors. Data
preprocessing is a proven method of resolving such
issues.

Why is Data preprocessing important?

Preprocessing of data is mainly to check the data


quality. The quality can be checked by the following

 Accuracy: To check whether the data entered is


correct or not.

Inaccurate data means having incorrect attribute values. Consider


the following
items _sold relation.
In above Table 3.3 of items sold relation of ATTRONICS company
has tuple for transaction T2 with Item_ID IT24 and quantity -1012.
In this tuple, quantity of item sold seems incorrect which may have
typing error or
some garbage entry during the auto transmission of data.There are
many reasons, which may responsible for inaccurate data:
 The data collection instruments used may be faulty.
 There may have been human or computer errors occurring at
data entry.
 Users may purposely submit incorrect data values for
mandatory fields when they do not wish to submit personal
information (e.g., by choosing the default value “January 1"
displayed for birthday). This is known as disguised missing data,
 Errors in data transmission can also occur.
 There may be technology limitations such as limited buffer size
for coordinating synchronised data transfer and consumption.

 Completeness: To check whether the data is


available or not recorded.

Consider the instance of branch relation of ATTRONICS company.


Table 3.4: Branch Relation

In above Table 3.4 occupation of customer C03 is not available.


Also age of customer CO4 is also missing. Incomplete data can
occur for a number of reasons:
 Attributes of interest may not always be available, such as
customer information
for sales transaction data.
 Other data may not be included simply because they were not
considered important at the time of entry.
 Relevant data may not be recorded due to a misunderstanding
or because of
equipment malfunctions.
 Data that were inconsistent with other recorded data may have
been deleted.
Furthermore, the recording of the data history or modifications
may have been
 overlooked.
 Missing data, particularly for tuples with missing values for
some attributes, may
Need to be inferred.

 Consistency: To check whether the same data is


kept in all the places that do or do not match.

Incorrect and redundant data may result from inconsistencies in


naming conventions
or data codes, or inconsistent formats for input fields (e.g., date).
Duplicate tuples also require data cleaning.

 Timeliness: The data should be updated correctly.

Timeliness also affects data quality. Failure in to follow the schedule


of record
submission may be occur due many reasons like:
 At the time of record submission numerous corrections and
adjustments occurs.
 Technical error during data uploading.
 Unavailability of responsible person.
For a period of time following each month, the data stored in the
database are incomplete.However, once all of the data are received,
it is correct. The fact that the month-end data are not updated in a
timely fashion has a negative impact on the data quality.

 Believability: The data should be trustable.

 Interpretability: The understandability of the


data.

Suppose that a database, at one point, had several errors, all of


which have since been corrected.The past errors, however, had
caused many problems for sales department users, and
so they no longer trust the data.The data also use many
accounting codes, which the sales department does not know how
to interpret.Even though the database is now accurate, complete,
consistent, and timely, users may regard it as of low quality due to
poor believability and interpretability.
Major Tasks in Data Preprocessing:

1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation

Data cleaning:

Data cleaning is the process to remove incorrect data,


incomplete data and inaccurate data from the datasets,
and it also replaces the missing values. There are some
techniques in data cleaning

Handling missing values:


 Standard values like “Not Available” or “NA” can
be used to replace the missing values.
 Missing values can also be filled manually but it is
not recommended when that dataset is big.
 The attribute’s mean value can be used to replace
the missing value when the data is normally
distributed
wherein in the case of non-normal distribution
median value of the attribute can be used.
 While using regression or decision tree algorithms
the missing value can be replaced by the most
probable
value.

Noisy data:…

Noisy generally means random error or containing


unnecessary data points. Here are some of the
methods to handle noisy data.

 Binning: This method is to smooth or handle noisy


data. First, the data is sorted and then the sorted
values are separated and stored in the form of
bins. There are three methods for smoothing data
in the bin.
 Smoothing by bin mean method: In this
method, the values in the bin are replaced by the
mean value of the bin;
 Smoothing by bin median: In this method, the
values in the bin are replaced by the median value;
 Smoothing by bin boundary: In this method, the
using minimum and maximum values of the bin
values are taken and the values are replaced by
the closest boundary value.
 Binning Methods for Data Smoothing
 Binning Methods for Data Smoothing
 The binning method can be used for smoothing the data.
 Mostly data is full of noise. Data smoothing is a data pre-
processing technique using a different kind of algorithm to
remove the noise from the data set. This allows important
patterns to stand out.

 Unsorted data for price in dollars


 Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
 First of all, sort the data
 After Sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34

 Smoothing the data by equal frequency bins


 Bin 1: 8, 9, 15, 16
 Bin 2: 21, 21, 24, 26,
 Bin 3: 27, 30, 30, 34

 Smoothing by bin means


 For Bin 1:
 (8+ 9 + 15 +16 / 4) = 12
 (4 indicating the total values like 8, 9 , 15, 16)
 Bin 1 = 12, 12, 12, 12

 For Bin 2:
 (21 + 21 + 24 + 26 / 4) = 23
 Bin 2 = 23, 23, 23, 23

 For Bin 3:
 (27 + 30 + 30 + 34 / 4) = 30
 Bin 3 = 30, 30, 30, 30

 How to smooth data by bin boundaries?


 You need to pick the minimum and maximum value. Put the
minimum on the left side and maximum on the right side.
 Now, what will happen to the middle values?
 Middle values in bin boundaries move to its closest neighbor
value with less distance.
Unsorted data for price in dollars:
 Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
 First of all, sort the data
 After sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
 Smoothing the data by equal frequency bins
 Bin 1: 8, 9, 15, 16
 Bin 2: 21, 21, 24, 26,
 Bin 3: 27, 30, 30, 34
 Smooth data after bin Boundary
 Before bin Boundary: Bin 1: 8, 9, 15, 16
 Here, 8 is the minimum value and 16 is the maximum value.9
is near to 8, so 9 will be treated as 8. 15 is more near to 16 and
farther away from 8. So, 15 will be treated as 16.
 After bin Boundary: Bin 1: 8, 8, 16, 16

 Before bin Boundary: Bin 2: 21, 21, 24, 26,


 After bin Boundary: Bin 2: 21, 21, 26, 26,

 Before bin Boundary: Bin 3: 27, 30, 30, 34


 After bin Boundary: Bin 3: 27, 27, 27, 34

 Advantages (Pros) of data smoothing


 Data smoothing clears the understandability of different
important hidden patterns in the data set.
 Data smoothing can be used to help predict trends. Prediction
is very helpful for getting the right decisions at the right time.

 Regression: This is used to smooth the data and


will help to handle data when unnecessary data is
present. For the analysis, purpose regression helps
to decide the variable which is suitable for our
analysis.
 Clustering: This is used for finding the outliers
and also in grouping the data. Clustering is
generally used in unsupervised learning.

Data integration:

The process of combining multiple sources into a single


dataset. The Data integration process is one of the
main components in data management. There are
some problems to be considered during data
integration.

 Schema integration: Integrates metadata(a set


of data that describes other data) from different
sources.
 Entity identification problem: Identifying
entities from multiple databases. For example, the
system or the user should know student _id of one
database and student_name of another database
belongs to the same entity.
 Detecting and resolving data value concepts:
The data taken from different databases while
merging may differ. Like the attribute values from
one database may differ from another database.
For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.

Data reduction:

This process helps in the reduction of the volume of the


data which makes the analysis easier yet produces the
same or almost the same result. This reduction also
helps to reduce storage space. There are some of the
techniques in data reduction are Dimensionality
reduction, Numerosity reduction, Data compression.

 Dimensionality reduction: This process is


necessary for real-world applications as the data
size is big. In this process, the reduction of random
variables or attributes is done so that the
dimensionality of the data set can be reduced.
Combining and merging the attributes of the data
without losing its original characteristics. This also
helps in the reduction of storage space and
computation time is reduced. When the data is
highly dimensional the problem called “Curse of
Dimensionality” occurs.
 Numerosity Reduction: In this method, the
representation of the data is made smaller by
reducing the volume. There will not be any loss of
data in this reduction.
 Data compression: The compressed form of data
is called data compression. This compression can
be lossless or lossy. When there is no loss of
information during compression it is called lossless
compression. Whereas lossy compression reduces
information but it removes only the unnecessary
information.

Data Transformation:

The change made in the format or the structure of the


data is called data transformation. This step can be
simple or complex based on the requirements. There
are some methods in data transformation.

 Smoothing: With the help of algorithms, we can


remove noise from the dataset and helps in
knowing the important features of the dataset. By
smoothing we can find even a simple change that
helps in prediction.
 Aggregation: In this method, the data is stored
and presented in the form of a summary. The data
set which is from multiple sources is integrated
into with data analysis description. This is an
important step since the accuracy of the data
depends on the quantity and quality of the data.
When the quality and the quantity of the data are
good the results are more relevant.
 Discretization: The continuous data is splited
into intervals. Discretization reduces the data size.
For example, rather than specifying the class time,
we can set an interval like (3 pm-5 pm, 6 pm-8
pm).
 Normalization: It is the method of scaling the
data so that it can be represented in a smaller
range. Example ranging from -1.0 to 1.0.

3.1 DATA OBJECTS AND ATTRIBUTES TYPES


Data are the basic units of information that are collected through
observation. Data are a set of values of qualitative or quantitative
variables about one or more persons or objects, while a datum
(singular of data) is a single value of a single variable.
So, data is information that has been translated into a form that is
efficient for movement or processing.
For preprocessing of data as well as for exploratory data analytics, it
is important to know the type of data that needs to be dealt with.
Knowing the type of data helps in choosing the right statistical
measure, the appropriate data visualization and so on.
There are mainly two types of data namely, Categorical data and
Numerical data.Categorical Data is non-numeric and consists of text
that can be coded as numeric. However, these numbers do not
represent any fixed mathematical notation or meaning for the text
and are simply assigned as labels or codes.Categorical data can be of
two types namely, Nominal data is used to label variables without
providing any quantitative value and Ordinal data of data is
used to label variables that need to follow some order.
Numerical Data; This type of data is numeric and it usually follows an
order of values: These quantitative data represent fixed values and
can be of two types namely, Interval data follows numeric scales in
which the order and exact differences between the values is
considered and Ratio data also follows numeric scales and has an
equal and definitive ratio between each data. Data is a collection of
data objects and their attributes.

Data Attributes
'An attribute is a property or characteristic of an object. A data
attribute is a single-value descriptor for a data object. For example,
eye color of a person, name of a student, etc.
Attribute is also known as variable, field, characteristic, or feature.
The distribution of data involving one attribute (or variable) is called
univariate. A bivariate distribution involves two attributes, and so on.

Data Objects
A collection of attributes which describes an object. Data objects can
also be referred to as samples, examples, instances, case, entity, data
points or objects.If the data objects are stored in a database, they
are data tuples. That is, the rows of a database correspond to the
data objects, and the columns correspond to the attributes
(See Table 3.1)
Consider the case study of the company named ATTRONICS is
described by the
relation tables: customer, item, employee, and branch.

The headers of the tables described here are shown as per


followings;

customer(Cust_ID,Name,Address,Age,Occupation,Annual_Income,C
redit Information, Category)

item (Item ID, Brand, Category, Type, Price, Place_Made, Supplier,


Cost)

employee (Emp_ID, Name, Category, Group, Salary, Commission)

branch (Br ID, Name, Address)

purchases (Trans ID, Cust ID, Emp_ID, Date, Time, Method Paid,
Amount)

items_sold (Trans_ID, Item_ID, Qty)

works_at (emp_ID, Br_ID)


The relation customer consists of a set of attributes describing the
customer information, including a unique customer identity number
(Cust ID), Cust_Name,Address, Age, Occupation, Annual Income,
Credit Information and Category.

Similarly, each of the relations Item, Employee and Branch (See Table
3.1) consists of set of attributes describing the properties of these
entities. Tuples/rows in the table are known as data objects.

What is an Attribute?

The attribute can be defined as a field for storing the


data that represents the characteristics of a data
object. The attribute is the property of the object. The
attribute represents different features of the object.
For example, hair color is the attribute of a lady.
Similarly, rollno, and marks are attributes of a student.
An attribute vector is commonly known as a set of
attributes that are used to describe a given object.

Type of attributes
We need to differentiate between different types of
attributes during Data-preprocessing. So firstly, we
need to differentiate between qualitative and
quantitative attributes.
Quantitative data is anything that can be counted or measured; it
refers to numerical data. Qualitative data is descriptive, referring to
things that can be observed but not measured—such as colors or
emotions.
1. Qualitative Attributes
such as Nominal, Ordinal, and Binary Attributes.
2. Quantitative Attributes such as Discrete and
Continuous Attributes.
There are different types of attributes. some of these
attributes are mentioned below;
Example of attribute
In this example, RollNo, Name, and Result are
attributes of the object named as a student.
Roll Nam Resul
o e t
1 Ali Pass
Akra
2 Fail
m

Types Of attributes

 Binary
 Nominal
 Ordinal Attributes
 Numeric
o Interval-scaled
o Ratio-scaled

Nominal Attributes

Nominal data is in alphabetical form and not in an


integer. Nominal Attributes are Qualitative Attributes.
Suppose that Hair color and Marital status are two
attributes describing person objects. In our
application, possible values for Hair color are black,
brown blond, red, auburn, grey, and white. Marital
status can take on the values single, married, divorced,
and widowed. Both Hair color and Marital status are
nominal attributes. Occupation is another example,
with the values teacher, dentist, programmer, farmer,
and so on.
Although we said that the values of a nominal attribute
are symbols or “names of things”, it is possible to
represent such symbols or “names” with numbers.
With Hair color, for instance, we can assign a code of 0
for black, 1 for brown, and so on. Customor ID is
another example of a nominal attribute whose possible
values are all numeric. However, in such cases, the
numbers are not intended to be used quantitatively.
That is, mathematical operations on values of nominal
attributes are not meaningful. It makes no sense to
subtract one customer ID number from another,
unlike, say, subtracting an age value from another.
(Age is a numeric attribute). Even though a nominal
attribute may have integers as values, it is not
considered a numeric attribute because the integers
are not meant to be used quantitatively.
Because nominal attribute values do not have any
meaningful order about them and are not quantitative,
it makes no sense to find the mean (average) value or
median (middle) value for such an attribute, given a
set of objects.
Examples of Nominal attributes
In this example, sates and colors are the attribute
and New, Pending, Working, Complete, Finish and
Black, Brown, White, and Red are the values.
Attribute Value
Categorical Lecturer, Assistant Professor,
data Professor
New, Pending, Working, Complete,
States
Finish
Colors Black, Brown, White, Red

Binary Attributes

Binary data have only two values/states. A binary


attribute is a nominal attribute with only two
categories or states: 0 or 1, where 0 typically means
that the attribute is absent, and 1 means that it is
present. Binary attributes are referred to as Boolean if
the two states correspond to true and false.Binary
Attributes are Qualitative Attributes.
Examples of Binary Attributes
Attribute Value
Corona
Yes, No
detected
Pass,
Result
Fail

The binary attribute is of two types;

1. Symmetric binary
2. Asymmetric binary

Examples of Symmetric data

Both values are equally important. For example, if we


have open admission to our university, then it does not
matter, whether you are a male or a female.
Example:
Attribut
Value
e
Male,
Gender
Female

Examples of Asymmetric data

Both values are not equally important. For example,


Corona detected is more important than Corona not
detected. If a patient is with covid-19 and we ignore
him, then it can lead to death but if a person is not
Covid-19 detected and we ignore it, then there is no
special issue or risk.
Example
Attribute Value
Corona
Yes, No
detected
Pass,
Result
Fail

Ordinal Attributes

All Values have a meaningful order. For example,


Grade-A means highest marks, B means marks are less
than A, C means marks are less than grades A and B,
and so on. Ordinal Attributes are Quantitative
Attributes.
Examples of Ordinal Attributes
Attribute Value
A, B, C, D,
Grade
F
BPS- Basic pay
16, 17, 18
scale

Numeric Attributes
A numeric attribute is quantitative, that is, it is a
measurable quantity, rep- resented in integer or real
values. Numeric attributes can be interval-scaled or
ratio-scaled.
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of
equal-sized units. The values of interval-scaled
attributes have order and can be positive, 0, or
negative. Thus, in addition to providing a ranking of
values, such attributes allow us to compare and
quantify the difference between values.
Example 2.4 Interval-scaled attributes.
Temperature is an interval-scaled attribute. Sup- pose
that we have the outdoor temperature value for a
number of different days, where each day is an object.
By ordering the values, we obtain a ranking of the
objects with respect to temperature. In addition, we
can quantify the difference between values. For
example, a temperature of 20◦C is 5 degrees higher
than a temperature of 15◦C. Calendar dates are
another example. For instance, the years 2002 and
2010 are 8 years apart.
Because interval-scaled attributes are numeric, we can
compute their mean value, in addition to the median
and mode measures of central tendency.

Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an
inherent zero-point. That is, if a measurement is ratio-
scaled, we can speak of a value as being a multiple (or
ratio) of another value. In addition, the values are
ordered, and we can also compute the difference
between values. The mean, median, and mode can be
computed as well.
Example 2.5 Ratio-scaled attributes. Unlike
temperatures in Celsius and Fahrenheit, the Kelvin (K)
temperature scale has what is considered a true zero-
point (0 degrees K = −273.15◦C): It is the point at
which the particles that comprise matter have zero
kinetic energy. Other examples of ratio-scaled
attributes include Count attributes such as Years of
experience (where the objects are employees, for
example) and Number of words (where the objects are
documents). Additional examples include attributes to
measure weight, height, latitude and longitude
coordinates (e.g., when clustering houses), and
monetary quantities (e.g., you are 100 times richer
with $100 than with $1).

Discrete Attributes
Discrete data have a finite value. It can be in
numerical form and can also be in a categorical form.
Discrete Attributes are Quantitative Attributes. A
discrete attribute has a finite or countably infinite set
of values, which may or may not be represented as
integers. The attributes Hair color, Smoker, Medical
test, and Drink size each have a finite number of
values, and so are discrete. Note that discrete
attributes may have numeric values, such as 0 and 1
for binary attributes, or, the values 0 to 110 for the
attribute Age. An attribute is countably infinite if the
set of possible values is infinite, but the values can be
put in a one-to-one correspondence with natural
numbers. For example, the attribute customer ID is
countably infinite. The number of customers can grow
to infinity, but in reality, the actual set of values is
countable (where the values can be put in one-to-one
correspondence with the set of integers). Zip codes are
another example.

Examples of Discrete Data


Attribute Value
Teacher, Bussiness Man,
Profession
Peon etc
Postal
42200, 42300 etc
Code

Continuous Attribute
Continuous data technically have an infinite number of
steps.Continuous data is in float type. There can be
many numbers in between 1 and 2. These attributes
are Quantitative Attributes.
Example of Continuous Attribute
Attribut
Value
e
5.4…, 6.5…..
Height
etc
Weight 50.09…. etc

What Is Data Wrangling?


Data Wrangling is the process of gathering, collecting,
and transforming Raw data into another format for
better understanding, decision-making, accessing, and
analysis in less time. Data Wrangling is also known as
Data Munging.

Importance Of Data Wrangling


Data Wrangling is a very important step in a Data
science project. The below example will explain its
importance:
Books selling Website want to show top-selling books
of different domains, according to user preference. For
example, if a new user searches for motivational
books, then they want to show those motivational
books which sell the most or have a high rating, etc.
But on their website, there are plenty of raw data from
different users. Here the concept of Data Munging or
Data Wrangling is used. As we know Data wrangling is
not by the System itself. This process is done by Data
Scientists. So, the data Scientist will wrangle data in
such a way that they will sort the motivational books
that are sold more or have high ratings or user buy this
book with these package of Books, etc. On the basis of
that, the new user will make a choice. This will explain
the importance of Data wrangling.

Data wrangling deals with the below functionalities:


1. Data exploration: In this process, the data is
studied, analyzed, and understood by visualizing
representations of data.
2. Dealing with missing values: Most of the
datasets having a vast amount of data contain
missing values of NaN, they are needed to be
taken care of by replacing them with mean,
mode, the most frequent value of the column, or
simply by dropping the row having a NaN value.
3. Reshaping data: In this process, data is
manipulated according to the requirements,
where new data can be added or pre-existing
data can be modified.
4. Filtering data: Some times datasets are
comprised of unwanted rows or columns which
are required to be removed or filtered
5. Other: After dealing with the raw dataset with
the above functionalities we get an efficient
dataset as per our requirements and then it can
be used for a required purpose like data
analyzing, machine learning, data
visualization, model training etc.

3.3.1 Data Cleaning


 Real-world data tend to be incomplete, noisy, and inconsistent.
This dirty data can cause an error while doing data analysis.
 Data cleaning is done to handle irrelevant or missing data.Data
cleaning also known as data cleansing or scrubbing. Data is
cleaned by filling in the missing values, smoothing any noisy
data, identifying and removing outliers, and resolving any
inconsistencies.
 Data cleaning is the process of correcting or removing
incorrect, incomplete or duplicate data within a dataset.

3.3.1.1 Missing Values


 The raw data that is collected for analyzing usually consists of
several types of errors that need to be prepared and processed
for data analysis.
 Some values in the data may not be filled up for various
reasons and hence are considered missing.If in database, some
of the tuples have no recorded value for several attributes then
it will become difficult to proceed with data.For example, in
Table 3.4 occupation of customer C03 is not available, also age
of customer C04 is missing.In some cases, missing data arises
because data was never gathered in the first place for some
entities.
 Data analyst needs to take appropriate decision for handling
such data.

In general, there can be three cases of missing data s explained


below:
 Missing Completely At Random (MCAR), which occurs due to
someone forgetting to fill in the value or have lost the
information.

 Missing At Random Data (MAR), which occurs due to someone


purposely not filling up the data mainly due to privacy issues.

 Missing Not At Random (MNAR), which occurs as data maybe


not available. Analyst can take following actions for handling
such missing values like:

1. Ignore the Tuple:


This is usually done when the class label is missing, assuming the
mining task involves classification).
This method is not very effective, unless the tuple contains several
attributes withmissing values.It is especially poor when the
percentage of missing values per attribute varies considerably.By
ignoring the tuple, we do not make use of the remaining attributes'
values in the tuple. Such data could have been useful to the task at
hand.

2. Fill in the Missing Value Manually:


In general, this approach is time consuming and may not be feasible
given a large data set with many missing values.

3. Use a Global Constant to Fill in the Missing Value:


Replace all missing attribute values by the same constant such as a
label like "Unknown" Or -infinity.If missing values are replaced by,
say, "Unknown," then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in
common - that of "Unknown." Hence, although this method is
simple, it is not foolproof .
4. Use a Measure of Central Tendency for the Attribute (e.g., the
Mean or Median) to Fill in the Missing Value.
 For this, a particular column is selected for which the central
value (say, median) is found. Then the central value is replaced
with all the NaN values of that particular column.
 Instead of the median, the mean or mode value can also be
used for the same.Replacing NaN values with either mean,
mode or median is considered as a statistical approach of
handling the missing values.

5. Use the Attribute Mean or Median for all Samples belonging to


the same Class as the given Tuple:
 All the customers who belong to the same class, their missing
attribute value can be replaced by mean of that class only.

6.Use the most probable value to fill in the missing value:


 This may be determined with regression, inference-based tools
using a Bayesian formalism, or decision tree induction. So
prediction algorithms can be utilised to find missing values.
 For example income of any customer can be predicted by
training a decision tree with the help of remaining customer
data and value for missing attribute can be identified

Noisy Data
 The noisy data contains errors or outliers. For example, for
stored employee details, all values of the age attribute are
within the range 22-45 years whereas one record reflects the
age attribute value as 80.
 There are times when the data is not missing, but it is
corrupted for some reason. This is, in some ways, a bigger
problem than missing data.
 Data corruption may be a result of faulty data collection
instruments, data entry problems, or technology limitations.

Table 3.5: items sold Relation

 In items_sold relation of ATTRONICS company, transaction T2


has incorrect value of items quatity, which treated as noisy
data.
 There is no single technique to take care of missing data, there
is no one way to remove noise, or smooth out the noisiness in
the data.
 Following topics explained some causes of noisy data:

Duplicate Entries:
 Duplicate entries in a dataset are big problem, before start
analysis it is suppose to identify such duplicity and handle it
properly.
 In those cases, we usually want to compact them into one
entry, adding an additional column that indicates how many
unique entries there were. In other cases, the duplication is
purely a result of how the data was generated.
 For example, it might be derived by selecting several columns
from a larger dataset, and there are no duplicates if we count
the other columns.
 Data duplication can also occur when you are trying to group
data from various sources. This is a common issue with
organizations that use webpage scraping tools to accumulate
data from various websites.
 Duplicate entries consequences many problems like it leads
data redundancy, inconsistencies, degrading data quality,
which impact on data analysis outcomes.

Multiple Entries for a Single Entity,


 In real world databases, each entity logically corresponds to
one row in the dataset,but some entities are repeated multiple
times with different data.
 The most common cause of this is that some of the entries are
out of date, and only one row is currently correct.
 Another case where there can be multiple entries is if, for some
reason, the same entity is occasionally processed twice by
whatever gathered the data.

NULLS:
 If value of an attribute is not known then it considered as NULL.
 NULLs can arise because the data collection process was failed
in some way.
 When it comes time to do analytics, NULLs cannot be processed
by many algorithms.
 In these cases, it is often necessary to replace the missing
values with some reasonable proxy.

What we will see most often is that it is guessed from other data
fields, or you simply putting the mean of all the non- null values.
For example mean of age attribute for all Gold category customer is
35. so for customer CO4, Null value of age can be replaced with 35.
Also in some cases, the NUL].
values arise because that data was never collected.

Huge Outliers:
 An outlier is a data point that differs significantly from other
observations.
 They are extreme values that deviate from other observations
on data; they may indicate variability in a measurement,
experimental errors or a novelty.

Most common causes of outliers on a data set:


1. Data entry errors (human errors).
2. Measurement errors (instrument errors).
3. Experimental errors (data extraction or experiment
planning/executing errors).
4. Intentional (dummy outliers made to test detection methods).
5.Data processing errors (data manipulation or data set unintended
mutations).
6.Sampling errors (extracting or mixing data from wrong or various
sources).
7. Natural (not an error, novelties in data).
 Sometimes, a massive outlier in the data is there because there
was truly an unusual event. How to deal with that depends on
the context. Sometimes, the outliers should be filtered out of
the dataset.
 For example, we are usually interested in predicting page views
by humans. A huge spike in recorded traffic is likely to come
from a bot attack, rather than any activities of humans.
 In other cases, outliers just mean missing data. Some storage
systems don't allow the explicit concept of a NULL value, so
there is some predetermined value that signifies missing data.
If many entries have identical, seemingly arbitrary values, then
this might be what's happening.

Out-of-Date Data:
 In many databases, every row has a timestamp for when it was
entered. When an entry is updated, it is not replaced in the
dataset; instead, a new row is put in that has an up-to-date
timestamp.
 For this reason, many datasets include entries that are no
longer accurate and only useful if you are trying to reconstruct
the history of the database.

Artificial Entries:
 Many industrial datasets have artificial entries that have been
deliberately inserted into the real data.
 This is usually done for purposes of testing the software
systems that process the data.

Irregular Spacings:
 Many datasets include measurements taken at regular
spacings. For example, you could have the traffic to a website
every hour or the temperature of a physical object measured at
every inch.
 Most of the algorithms that process data such as this assume
that the data points are equally spaced, which presents a major
problem when they are irregular.If the data is from sensors
measuring something such as temperature, then typically we
have to use interpolation techniques (Interpolation is the
process of using known data values to estimate unknown
data values.)
 to generate new values at a set of equally spaced points
 A special case of irregular spacings happens when two entries
have identical timestamps but different numbers. This usually
happens because the timestamps are only recorded to finite
precision.
 If two measurements happen within the same minute, and
time is only recorded up to.the minute, then their timestamps
will be identical.
Formatting Issues:
• Various formatting issues are explained below:

Formatting Is Irregular between Different Tables/Columns


 This happens a lot, typically because of how the data was
stored in the first place.
 It is an especially big issue when joinable/groupable keys are
irregularly formatted between different datasets.

Extra Whitespaces:
 A white space is the blank space among the text. An
appropriate use of white spaces will increase readability and
focus the readers' attention.
 For example, within a text, white spaces split big chunks of text
into small paragraphs which makes them easy to understand.
 String with and without blank spaces is not the same. "ABC" I="
ABC" these two ABCs are not equal, but the difference is so
small that you often don't notice.
 Without the quotes enclosing the string you hardly would ABC
I= ABC. But the computer programs are incorruptible in the
interpretation and if these values are a merging key, we would
receive an empty result.
 Blank strings, spaces, and tabs are considered as the empty
values represented as NaN. Sometimes it consequences an
unexpected results.
 Also, even though the white spaces are almost invisible, pile
millions of them into the file and they will take some space and
they may overflow the size limit of your database column
leading to an error,

Irregular Capitalization and Inconsistent Delimiters:


 Data set may have problems of irregular capitalization of text
data also a dataset will have a single delimiter, but sometimes,
different tables will use different ones.
 Mostly use delimiters are Commas, Tabs and Pipes (the vertical
line "|").

Irregular NULL Format:


There are a number of different ways that missing entries are
encoded into CSV files,and they should all be interpreted as NULLs
when the data is read in.
Some popular examples are the empty string «, "NA," and "NULL."
Occasionally, you will see others such as "unavailable" or «unknown"
as well.
Invalid Characters:
 Some data files will randomly have invalid bytes in the middle
of them.
 Some programs will throw an error if we try to open up
anything that isn't valid text.In these cases, we may have to
filter out the invalid bytes.

Weird or Incompatible Date and Times:


 Date and times are one of the most frequently used types of
data field. Some of the date formats we will see are as follows:
June 1, 2020
JUN 1, '20
2020-06-01
 There is an important way that dates and times are different
from other formatting issues.
 Most of the time we have two different ways of expressing the
same information, and a perfect translation is possible from the
one to the other. But with dates and times,the information
content itself can be different.
 For example, we might have just the date, or there could also
be a time associated with it. If there is a time, does it go out to
the minute, hour, second, or something else?
What about time zones?

Data Transformation
 Data transformation is the process of converting raw data into
a format or structure that would be more suitable for data
analysis.
 Data transformation is a data preprocessing technique that
transforms the data into alternate forms appropriate for
mining.Data transformation is a process of converting raw data
into a single and easy-to-read format to facilitate easy analysis.
 Data transformation is the process of changing the format,
structure, or values of data.The choice of data transformation
technique depends on how the data will be later used for
analysis.
 For example, date and time format changing are related with
data format transformation.
 Renaming, moving, and combining columns in a database are
related with structural transformation of data.
 Transformation of values of data is relevant with transformed
the data values into à range of values that are easier to be
analyzed. This is done as the values for different information
are found to be in a varied range of scales.
 For example, for a company, age values for employee can be
within the range of 20-55 years whereas salary values for
employees can be within the range of Rs. 10,000-Rs.1,00,000.
 this indicates one column in a dataset can be more weighted
compared to others due to the varying range of values. In such
cases, applying statistical measures for data analysis across this
dataset may lead to unnatural or incorrect results.
 Data transformation is hence required to solve this issue before
applying any analysis of data.
 Various data transformation techniques are used during data
preprocessing. The choice of data transformation technique
depends on how the data will be later used for analysis.
 Some of these important standard data preprocessing
techniques are Rescaling,Normalizing, Binarizing, Standardizing,
Label and One Hot Encoding.

Benefits of Data Transformations:


1. Data is transformed to make it better-organized. Transformed
data may be easier for both humans and computers to use.
2. Properly formatted and validated data improves data quality
and protects applications from potential landmines such as null
values, unexpected duplicates,incorrect indexing, and
incompatible formats.
3. Data transformation facilitates compatibility between
applications, systems, and types of data,Data used for multiple
purposes may need to be transformed in different ways, Many
strategies are available for data transformation in Data
preprocessing.

Some of the strategies for data transformation include the following:


1.Rescaling:
 Rescaling means transforming the data so that it fits within a
specific scale, like 0-100 or 0-1. Rescaling of data allows scaling
all data values to lie between a specified minimum and
maximum value (say, between 0 and 1).
 When the data encompasses attributes with varying scales,
many statistical or machine learning techniques prefer rescaling
the attributes to fall within a given scale.
 Scaling variables helps to compare different variables on equal
footing.
2.Normalizing:
 The measurement unit used can affect the data analysis. For
example, changing measurement units from meters to inches
for height or from kilograms to pounds for weight, may lead to
very different results.
 In general, expressing an attribute in smaller units will lead to a
larger range for that attribute and thus tend to give such an
attribute greater effect or "weight."
 To help avoid dependence on the choice of measurement units,
the data should be normalized.
 Normalization scaled the attribute data so as to fall within a
smaller range, such as 0.0 to 1.0 or -1.0 to 1.0
 Normalization ensures that the attributes values which we are
using in computations are not affected by trivial variations like
height, width, scaling factors, orientations etc.
 Normalizing the data attempts to give all attributes an equal
weight.

Binarizing:
 It is the process of converting data to either 0 or 1 based on a
threshold value.
 All the data values above the threshold value are marked 1
whereas all the data values
 equal to or below the threshold value are marked as O.
 Data binarizing is done prior to data analysis in many cases such
as, dealing with crisp
 values for the handling of probabilities and adding new
meaningful features in the dataset.

Standarizing:
 standardization also called mean removal. It is the process of
transforming attributes having a Gaussian distribution( normal
distribution )with differing mean and standard deviation values
into a standard Gaussian distribution with a mean of 0 and a
standard deviation of 1.
 In other words, Standardization is another scaling technique
where the values are centered around the mean with a unit
standard deviation.
 This means that the mean of the attribute becomes zero and
the resultant distribution has a unit standard deviation.
 standardization of data is done prior to data analysis in many
cases such as, in the case of linear discriminate analysis, linear
regression,

5. Label Encoding:
 The label encoding process is used to convert textual labels into
numeric form in order to prepare it to be used in a machine
readable form.
 The labels are assigned a value of 0 to (n-1) where n is the
number of distinct values for a particular categorical feature.
 The numeric values are repeated for the same label of that
attribute. For instance, let us consider the feature 'gender'
having two values - male and female.
 Using label encoding, each gender value will be marked with
unique numerical values starting with 0. Thus males will be
marked 0, females will be marked 1.

6.One Hot Coding:


 One hot encoding refers to splitting the column which contains
numerical categorical data to many columns depending on the
number of categories present in that column.
 Each column contains "0"'or "1” corresponding to which
column it has been placed.Many data science algorithms
cannot operate on label data directly. They require all input
variables and output variables to be numeric.
 Categorical data must be converted to a numerical form before
to proceed for data analysis.
 One hot coding is used for categorical variables where no
ordinal relationship exists among the variable's values.
 For example consider the variable named "color", It may have
value red, green, blue,etc. which have no specific order. In
other words different category of color (Red,green, blue etc.)
do not have any specific order.
 As a first step, each unique category value is assigned an
integer value. For example, "red" is 1, "green" is 2, and "blue" is
3.
 But assigning a numerical value creates a problem because the
integer values have anatural ordered relationship between
each other.
 But here we do not want or to assign any order to color
categories. In this case, a one-hot encoding can be applied to
the integer representation.This is where the integer encoded
variable is removed and a new binary variable is added for each
unique integer value.
For example,
Red Green Blue
1 0 0
0 1 0
1 0 1

 In the "color" variable example, there are 3 categories and


therefore 3 binary. variables are needed.
 A "1" value is placed in the binary variable for the color and "O"
values for the other colors. This encoding method is very useful
for encoding categorical variables where order of variable's
value not matter.
Data Reduction
 When the data is collected from different data sources for
analysis, it results in a huge amount of data. It is difficult for a
data analyst to deal with this large volume of data.
 It is even difficult to run the complex queries on the huge
amount of data as it takes a
long time and sometimes it even becomes impossible to track
the desired data.
 Data reduction is an essential and important phase in data
preprocessing that is carried out to reduce the unimportant or
unwanted features from a dataset.
 Data reduction techniques obtain a reduced representation of
the data while. minimising the loss of information content.
 Data reduction process reduces the volume of original data and
represents it in a much smaller volume. Data reduction
techniques ensure the integrity of data while
reducing the data.
 Data reduction is a preprocessing technique which helps in
obtaining reduced representation of data set (i.e., data set
having much smaller volume of data) from the available data
set.
 Strategies for data reduction include Dimensionality reduction,
Data cube aggregation,Numerosity reduction.

Dimensionality Reduction:
 Dimensionality reduction is the transformation of data from
high dimensional space into a low dimensional space so
that low dimensional space representation retains nearly
all the information ideally saying all the information only by
reducing the width of the data.
 Working with high dimensional space can be undesirable
for many reasons like raw data is mostly sparse and
results in high computational cost. Dimensionality
reduction is common in a field that deals with large
instances and columns.
 It can be divided into two main components - feature selection
(also known as attribute subset selection) and feature
extraction.

Type 1
Feature Selection:
Feature selection is the process of deciding which
variables (features, characteristics, categories, etc.) are
most important to your analysis. These features will be
used to train ML models. It’s important to remember,
that the more features you choose to use, the longer
the training process and, sometimes, the less accurate
your results, because some feature characteristics may
overlap or be less present in the data.
Feature selection is the process of extracting a subset of features
from the original set of all features of a dataset to obtain a smaller
subset that can be used to model a given problem.
Few of the standard techniques used for feature selection are:

Univariate Selection method works by inspecting each feature and


then finding the best feature based on statistical tests. It also
analyses the capability of these features in accordance with the
response variable.

**Recursive Feature Elimination method works by performing a


**greedy search to acquire the best feature subset from a given
dataset. This is done in an iterative process by determining the best
or the worst feature at each iteration.
Heuristic methods (due to exponential # of
choices):
 Step-wise forward selection method
 Step-wise backward elimination method
 Combining forward selection and backward
elimination method
 Decision-tree induction method

**Stepwise Forward Selection method initially starts with an empty


set of attributes which is considered as the minimal set. In each
iteration the most relevant attribute is then added to the minimal set
until the stopping rule is satisfied. One of the stopping rules is to stop
when all remaining variables have a p-value above some threshold.

**Stepwise Backward Elimination method initially starts with all the


sets of attributes that are considered as the initial set. In each
iteration the most irrelevant attribute is then removed from the
minimal set until the stopping rule is satisfied. One of the stopping
rules is to stop when all remaining variables have a significant p-
value defined by some significance threshold.

Combination of Forward Selection and Backward Elimination


method is commonly used for attribute subset selection and works
by combining both the methods of forward selection and backward
elimination.

**Decision Tree Induction method uses the concept of decision


trees for attribute selection. A decision tree consists of several nodes
that have branches. The nodes of a decision tree indicate a test
applied on an attribute while the branch indicates the outcome of
the test. The decision tree helps in discarding the irrelevant
attributes by considering those attributes that are not a part of the
tree.

Type2:
Feature Extraction:
Feature extraction process is used to reduce the data in a high
dimensional space to a lower dimension space.
While feature selection chooses the most relevant features from
among a set of given features, feature extraction creates a new,
smaller set of features that consists of the most useful information.
Few of the methods for dimensionality reduction include Principal
Component Analysis (PCA), Linear Discriminant Analysis (LDA) and
Generalized Discriminant Analysis (GDA).
These methods are discussed below:

**() Principal Component Analysis (PCA): PCA is an unsupervised


method . Using PCA can help identify correlations
between data points, PCA forms the basis
of multivariate data analysis based on projection
methods. The most important use of PCA is to
represent a multivariate data table as smaller set of
variables (summary indices) in order to observe trends,
jumps, clusters and outliers. This overview may
uncover the relationships between observations and
variables, and among the variables.

**() Linear Discriminant Analysis (LDA): LDA is a supervised


method of feature extraction. Linear Discriminant Analysis is a
dimensionality reduction technique used as a
preprocessing step in Machine Learning and pattern
classification applications.The main goal of
dimensionality reduction techinques is to reduce the
dimensions by removing the reduntant and dependent
features by transforming the features from higher
dimensional space to a space with lower dimensions.
However, it can be used for only labeled data and can be thus used
only in certain situations. The data has to be normalized before
performing LDA.

**() Generalized Discriminant Analysis (GDA): GDA deals with


nonlinear discriminant analysis using kernel function operator.
Similar to LDA, the objective of GDA is to find a projection for the
features into a lower-dimensional space by maximizing the ratio of
between-class scatters to within-class scatter. The main idea is to
map the input space into a convenient feature space in which
variables are nonlinearly related to the input space.

 Feature selection and feature extraction are extensively carried


out as data preprocessing techniques for dimensionality
reduction.
 This helps in removing redundant features, reducing
computation time, as well as in reducing storage space.
 However, dimensionality reduction results in loss of data and
should be used with proper understanding to effectively carry
out data preprocessing before performing analysis of data.

Data Cube Aggregation:


 A data cube (or datacube) is a multi-dimensional ("n-D") array
of values. A data cube is generally used to easily interpret data.
 Data cube is especially useful when representing data together
with dimensions as certain measures of business requirements.
 Data cube aggregation is a process in which information is
gathered and expressed in a summary form for purposes such
as statistical analysis.
 Data cubes store multidimensional aggregated information. For
example, a sales report has been prepared to analyze the
number of sales of mobile phones per brand in each branch for
the year 2009 to 2019. This can be represented in the form of a
data cube as shown in Figure 3. This Figure has three
dimensions - time, brand and branch. Data cubes provide fast
access to pre-computed, summarized data, thereby benefiting
online analytical processing as well as data mining. They are
optimized for analytical purposes so that they can report on
millions of records at a time.

Fig. 3.6: Data Cube


The cube created at the lowest abstraction level is referred to as the
base cuboid. The base cuboid should correspond to an individual
entity of interest such as sales or customer.
In other words, the lowest level should be usable, or useful for the
analysis. A cube at the highest level of abstraction is the apex cuboid.
For the sales data in Fig. 3.7 the apex cuboid would give one total -
the total sales for all three years, for all item types, and for all
branches.
Data cubes created for varying levels of abstraction are often
referred to as cuboids, so that a data cube may instead refer to a
lattice of cuboids. Each higher abstraction level further reduces the
resulting data size.
consider we have the data of ATTRONICS Company sales per quarter
for the year 2005 to the year 2010.
In case We want to get the annual sale per year then we just have to
aggregate the sales per quarter for each year.
In this way, aggregation provides us with the required data which is,
much smaller in size and thereby we achieve data reduction even
without losing any data.

Fig. 3.7: Aggregated Data


The data cube aggregation is a multidimensional aggregation which
eases multidimensional analysis. Like in the image below the data
cube represent annual sale for each item for each branch.
The data cube present pre-computed and summarized data which
eases the data mining into fast access.

Fig. 3.8: Data Cube Aggregation

3. Numerosity Reduction:
 Numerosity reduction reduces the data volume by choosing
alternative smaller forms of data representation.
 Numerosity reduction method is used for converting the data
to smaller forms so as to reduce the volume of data.
 Numerosity reduction may be either parametric or non
parametric as explained below

(1) Parametric methods model is used to represent data in


which parameters of the data are stored, rather than the
data itself.( we can store mean ,median ,mode ,variance
of data as parameters)Examples of parametric models
include regression and log-linear models.these are used
to approximate the given data.

2)Non-parametric methods are used for storing reduced


representations of the data. Examples of non-parametric models
include clustering (grouping of data of same type), histograms(it
uses binning method to approximate the data) ,sampling, and data
cube aggregation.

3.3.4 Data Discretization


Data discretization is characterized as a method of translating
attribute values of continuous data into a finite set of intervals with
minimal information loss. Data discretization facilitates the transfer
of data by substituting interval marks for the values of numeric data.
similar to the values for the 'generation' variable, interval labels such
as (0-10, 11-20..) Or (0-10, 11-20...) may be substituted (kid, youth,
adult, senior).
The data discretization technique is used to divide the attributes of
the continuous nature into data with intervals.
We replace many constant values of the attributes by labels of small
intervals. This means that mining results are shown in a concise and
easily understandable way.

The two approaches for discretization are explained below:


1. Top-down Discretization: If we first consider one or a couple of
points (so-called breakpoints or split points) to divide the whole
set of attributes and repeat of this method up to the end, then
the process is known as top-down discretization also known as
splitting. Top- down methods start from the initial interval and
recursively split it into smaller intervals.

2. Bottom-up Discretization: If we first consider all the constant


values as split-points, some are discarded through a
combination of the neighbourhood values in the interval, that
process is called bottom-up discretization. Bottom-up methods
start from the set of single value intervals and iteratively merge
neigh- boring intervals.

Some of the data discretization strategies are as per followings:

Discretization by Binning:
 Binning is the famous methods of data discretization. Data
binning, bucketing is data pre-processing method used to
minimize the effects of small observation errors. The original
data values are divided into small intervals known as bins and
then the
 are replaced by a general value calculated for that bin.

 This has a smoothing effect on the input data and may also
reduce the chances of over fitting in case of small datasets.
 Binning is a top-down splitting technique based on a specified
number of bins. These methods are also used as discretization
methods for data reduction and concept hierarchy generation.
 Binning does not use class information and is therefore an
unsupervised discretization technique. It is sensitive to the
user-specified number of bins, as well as the presence of
outliers.
 For example, attribute values can be discretized by applying
equal-width or equal-frequency binning, and then replacing
each bin value by the bin mean or median respectively.
 These techniques can be applied recursively to the resulting
partitions to generate concept hierarchies.
 Distributing of values into bins can be done in a number of
ways. One such way is called equal width binning in which the
data is divided into n intervals of equal size.
 The width wof the interval is calculated as w = (max _value -
min_value) / n.
 Another way of binning is called equal frequency binning in
which the data is divided into n groups and each group contains
approximately the same number of values as
 shown in the example below:

Equal Frequency Binning: Bins have equal frequency.


Equal Width Binning: Bins have equal width with a range of each bin
are defined
as [min + wl, [min + 2w] .….. [min + nw] where, w = (max - min) / (no
of bins).
Equal Frequency:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5, 10, 11, 13J
[15, 35, 50, 551
[72, 92, 204, 215]

Equal Width Binning:

Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5,10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204,215]

 One of the ways of finding the value of n in case of both equal


width binning and equal frequency binning is by plotting a
histogram and then trying different intervals to find an
optimum value of n.
 Both equal width binning and equal frequency binning are
unsupervised binning methods as these methods transform
numerical values into categorical counterparts. without using
any class information.
Discretization by Histogram Analysis:
 Like binning, histogram analysis is an unsupervised
discretization technique because it does not use class
information.
 Histogram analysis partitions the values for an attribute into
disjoint ranges called buckets.
 In histogram analysis the histogram distributes an attribute's
observed value into a disjoint subset, often called buckets or
bins.
 A histogram partitions the values of an attribute, A, into disjoint
ranges called buckets or bins. Various partitioning rules can be
used to define histograms. In an equal-width histogram
 For example Consider the following data are a list of
ATTRONICS Company prices for commonly sold items (rounded
to the nearest dollar). The numbers have been sorted:
 1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15,
15, 15, 15, 18, 18, 18, 18, 18,18, 18, 18, 20, 20, 20, 20, 20, 20,
20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

 To further reduce the data, it is common to have each bucket


denote a continuous value range for the given attribute,

In Fig. 3.9, each bucket represents a different $10 range for price.
There are several partitioning rules, including the following:

Equal-width: In an equal-width histogram, the width of each bucket


range is uniform (e.g., the width of $10 for the buckets in Fig. 3.10).

Equal-frequency (or Equal-depth): In an equal-frequency histogram,


the buckets are created so that, roughly, the frequency of each
bucket is constant (i.e., each bucket contains roughly the same
number of contiguous data samples).
For example, the values are partitioned into equal-size partitions or
ranges (e.g., earlier in Fig. 3.10 for price, where each bucket has a
width of $10).
 With an equal-frequency histogram, the values are partitioned
so that, ideally, each partition contains the same number of
data tuples.
 The histogram analysis algorithm can be applied recursively to
each partition in order to automatically generate a multilevel
concept hierarchy, with the procedure terminating once a pre-
specified number of concept levels has been reached.
 A minimum interval size can also be used per level to control
the recursive procedure. This specifies the minimum width of a
partition, or the minimum number of values for each partition
at each level.
 Histograms can 'also be partitioned based on cluster analysis of
the data distribution, as described next.

Discretization by Cluster, Decision Tree and Correlation Analysis:


 Clustering, decision tree analysis and correlation analysis can
be used for data discretization.
 Cluster analysis is a popular data discretization method. Cluster
analysis method discretizes a numerical attribute by
partitioning its value into clusters.
 A clustering algorithm can be applied to discretize a numeric
attribute, A, by partitioning the values of A into clusters or
groups.
 Clustering takes the distribution of A into consideration, as well
as the closeness of data points, and therefore is able to
produce high-quality discretization results.
 Clustering can be used to generate a concept hierarchy for A by
following either a top-down splitting strategy or a bottom-up
merging strategy, where each cluster forms a node of the
concept hierarchy.
 In the former, each initial cluster or partition may be further
decomposed into several sub-clusters, forming a lower level of
the hierarchy.
 In the latter, clusters are formed by repeatedly grouping
neighboring clusters in order to form higher-level concepts.
 Techniques to generate decision trees for classification can be
applied to discretization. Such techniques employ a top-down
splitting approach. Unlike the other methods mentioned so far,
decision tree approaches to discretization are supervised, that
is, they make use of class label information.
 For example, we may have a data set of patient symptoms (the
attributes) where each patient has an associated diagnosis class
label.
 Class distribution information is used in the calculation and
determination of split- points (data values for partitioning an
attribute range).
 Intuitively, the main idea is to select split-points so that a given
resulting partition contains as many tuples of the same class as
possible.
 Entropy is the most commonly used measure for this purpose.
To discretize a numeric attribute, A, the method selects the
value of A that has the minimum entropy as a plit-point, and
recursively partitions the resulting intervals to arrive at a
hierarchical
 discretization.
 Such discretization forms a concept hierarchy for A. Because
decision tree-based discretization uses class information, it is
more likely that the interval boundaries(split-points) are
defined to occur in places that may help improve classification
 Measures of correlation can be used for discretization.
ChiMerge is a X2 based discretization method which uses
bottom-up approach by finding the best neighboring intervals
and then merging them to form larger intervals, recursively.
 As With decision tree analysis, ChiMerge is supervised in that it
uses class information. The basic notion is that for accurate
discretization, the relative class frequencies should be fairly
consistent within an interval.
 Therefore, if two adjacent intervals have a very similar
distribution of classes, then the intervals can be merged.
Otherwise, they should remain separate.
 ChiMerge proceeds as follows:
 Initially, each distinct value of a numeric attribute A is
considered to be one interval. X2 tests are performed for every
pair of adjacent intervals.
 Adjacent intervals with the least x2 values are merged
together, because low x?values for a pair indicate similar class
distributions. This merging process proceeds recursively until a
predefined stopping criterion is met.

You might also like