Chapter-3 data processing
Chapter-3 data processing
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Data cleaning:
Noisy data:…
For Bin 2:
(21 + 21 + 24 + 26 / 4) = 23
Bin 2 = 23, 23, 23, 23
For Bin 3:
(27 + 30 + 30 + 34 / 4) = 30
Bin 3 = 30, 30, 30, 30
Data integration:
Data reduction:
Data Transformation:
Data Attributes
'An attribute is a property or characteristic of an object. A data
attribute is a single-value descriptor for a data object. For example,
eye color of a person, name of a student, etc.
Attribute is also known as variable, field, characteristic, or feature.
The distribution of data involving one attribute (or variable) is called
univariate. A bivariate distribution involves two attributes, and so on.
Data Objects
A collection of attributes which describes an object. Data objects can
also be referred to as samples, examples, instances, case, entity, data
points or objects.If the data objects are stored in a database, they
are data tuples. That is, the rows of a database correspond to the
data objects, and the columns correspond to the attributes
(See Table 3.1)
Consider the case study of the company named ATTRONICS is
described by the
relation tables: customer, item, employee, and branch.
customer(Cust_ID,Name,Address,Age,Occupation,Annual_Income,C
redit Information, Category)
purchases (Trans ID, Cust ID, Emp_ID, Date, Time, Method Paid,
Amount)
Similarly, each of the relations Item, Employee and Branch (See Table
3.1) consists of set of attributes describing the properties of these
entities. Tuples/rows in the table are known as data objects.
What is an Attribute?
Type of attributes
We need to differentiate between different types of
attributes during Data-preprocessing. So firstly, we
need to differentiate between qualitative and
quantitative attributes.
Quantitative data is anything that can be counted or measured; it
refers to numerical data. Qualitative data is descriptive, referring to
things that can be observed but not measured—such as colors or
emotions.
1. Qualitative Attributes
such as Nominal, Ordinal, and Binary Attributes.
2. Quantitative Attributes such as Discrete and
Continuous Attributes.
There are different types of attributes. some of these
attributes are mentioned below;
Example of attribute
In this example, RollNo, Name, and Result are
attributes of the object named as a student.
Roll Nam Resul
o e t
1 Ali Pass
Akra
2 Fail
m
Types Of attributes
Binary
Nominal
Ordinal Attributes
Numeric
o Interval-scaled
o Ratio-scaled
Nominal Attributes
Binary Attributes
1. Symmetric binary
2. Asymmetric binary
Ordinal Attributes
Numeric Attributes
A numeric attribute is quantitative, that is, it is a
measurable quantity, rep- resented in integer or real
values. Numeric attributes can be interval-scaled or
ratio-scaled.
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of
equal-sized units. The values of interval-scaled
attributes have order and can be positive, 0, or
negative. Thus, in addition to providing a ranking of
values, such attributes allow us to compare and
quantify the difference between values.
Example 2.4 Interval-scaled attributes.
Temperature is an interval-scaled attribute. Sup- pose
that we have the outdoor temperature value for a
number of different days, where each day is an object.
By ordering the values, we obtain a ranking of the
objects with respect to temperature. In addition, we
can quantify the difference between values. For
example, a temperature of 20◦C is 5 degrees higher
than a temperature of 15◦C. Calendar dates are
another example. For instance, the years 2002 and
2010 are 8 years apart.
Because interval-scaled attributes are numeric, we can
compute their mean value, in addition to the median
and mode measures of central tendency.
Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an
inherent zero-point. That is, if a measurement is ratio-
scaled, we can speak of a value as being a multiple (or
ratio) of another value. In addition, the values are
ordered, and we can also compute the difference
between values. The mean, median, and mode can be
computed as well.
Example 2.5 Ratio-scaled attributes. Unlike
temperatures in Celsius and Fahrenheit, the Kelvin (K)
temperature scale has what is considered a true zero-
point (0 degrees K = −273.15◦C): It is the point at
which the particles that comprise matter have zero
kinetic energy. Other examples of ratio-scaled
attributes include Count attributes such as Years of
experience (where the objects are employees, for
example) and Number of words (where the objects are
documents). Additional examples include attributes to
measure weight, height, latitude and longitude
coordinates (e.g., when clustering houses), and
monetary quantities (e.g., you are 100 times richer
with $100 than with $1).
Discrete Attributes
Discrete data have a finite value. It can be in
numerical form and can also be in a categorical form.
Discrete Attributes are Quantitative Attributes. A
discrete attribute has a finite or countably infinite set
of values, which may or may not be represented as
integers. The attributes Hair color, Smoker, Medical
test, and Drink size each have a finite number of
values, and so are discrete. Note that discrete
attributes may have numeric values, such as 0 and 1
for binary attributes, or, the values 0 to 110 for the
attribute Age. An attribute is countably infinite if the
set of possible values is infinite, but the values can be
put in a one-to-one correspondence with natural
numbers. For example, the attribute customer ID is
countably infinite. The number of customers can grow
to infinity, but in reality, the actual set of values is
countable (where the values can be put in one-to-one
correspondence with the set of integers). Zip codes are
another example.
Continuous Attribute
Continuous data technically have an infinite number of
steps.Continuous data is in float type. There can be
many numbers in between 1 and 2. These attributes
are Quantitative Attributes.
Example of Continuous Attribute
Attribut
Value
e
5.4…, 6.5…..
Height
etc
Weight 50.09…. etc
Noisy Data
The noisy data contains errors or outliers. For example, for
stored employee details, all values of the age attribute are
within the range 22-45 years whereas one record reflects the
age attribute value as 80.
There are times when the data is not missing, but it is
corrupted for some reason. This is, in some ways, a bigger
problem than missing data.
Data corruption may be a result of faulty data collection
instruments, data entry problems, or technology limitations.
Duplicate Entries:
Duplicate entries in a dataset are big problem, before start
analysis it is suppose to identify such duplicity and handle it
properly.
In those cases, we usually want to compact them into one
entry, adding an additional column that indicates how many
unique entries there were. In other cases, the duplication is
purely a result of how the data was generated.
For example, it might be derived by selecting several columns
from a larger dataset, and there are no duplicates if we count
the other columns.
Data duplication can also occur when you are trying to group
data from various sources. This is a common issue with
organizations that use webpage scraping tools to accumulate
data from various websites.
Duplicate entries consequences many problems like it leads
data redundancy, inconsistencies, degrading data quality,
which impact on data analysis outcomes.
NULLS:
If value of an attribute is not known then it considered as NULL.
NULLs can arise because the data collection process was failed
in some way.
When it comes time to do analytics, NULLs cannot be processed
by many algorithms.
In these cases, it is often necessary to replace the missing
values with some reasonable proxy.
What we will see most often is that it is guessed from other data
fields, or you simply putting the mean of all the non- null values.
For example mean of age attribute for all Gold category customer is
35. so for customer CO4, Null value of age can be replaced with 35.
Also in some cases, the NUL].
values arise because that data was never collected.
Huge Outliers:
An outlier is a data point that differs significantly from other
observations.
They are extreme values that deviate from other observations
on data; they may indicate variability in a measurement,
experimental errors or a novelty.
Artificial Entries:
Many industrial datasets have artificial entries that have been
deliberately inserted into the real data.
This is usually done for purposes of testing the software
systems that process the data.
Irregular Spacings:
Many datasets include measurements taken at regular
spacings. For example, you could have the traffic to a website
every hour or the temperature of a physical object measured at
every inch.
Most of the algorithms that process data such as this assume
that the data points are equally spaced, which presents a major
problem when they are irregular.If the data is from sensors
measuring something such as temperature, then typically we
have to use interpolation techniques (Interpolation is the
process of using known data values to estimate unknown
data values.)
to generate new values at a set of equally spaced points
A special case of irregular spacings happens when two entries
have identical timestamps but different numbers. This usually
happens because the timestamps are only recorded to finite
precision.
If two measurements happen within the same minute, and
time is only recorded up to.the minute, then their timestamps
will be identical.
Formatting Issues:
• Various formatting issues are explained below:
Extra Whitespaces:
A white space is the blank space among the text. An
appropriate use of white spaces will increase readability and
focus the readers' attention.
For example, within a text, white spaces split big chunks of text
into small paragraphs which makes them easy to understand.
String with and without blank spaces is not the same. "ABC" I="
ABC" these two ABCs are not equal, but the difference is so
small that you often don't notice.
Without the quotes enclosing the string you hardly would ABC
I= ABC. But the computer programs are incorruptible in the
interpretation and if these values are a merging key, we would
receive an empty result.
Blank strings, spaces, and tabs are considered as the empty
values represented as NaN. Sometimes it consequences an
unexpected results.
Also, even though the white spaces are almost invisible, pile
millions of them into the file and they will take some space and
they may overflow the size limit of your database column
leading to an error,
Data Transformation
Data transformation is the process of converting raw data into
a format or structure that would be more suitable for data
analysis.
Data transformation is a data preprocessing technique that
transforms the data into alternate forms appropriate for
mining.Data transformation is a process of converting raw data
into a single and easy-to-read format to facilitate easy analysis.
Data transformation is the process of changing the format,
structure, or values of data.The choice of data transformation
technique depends on how the data will be later used for
analysis.
For example, date and time format changing are related with
data format transformation.
Renaming, moving, and combining columns in a database are
related with structural transformation of data.
Transformation of values of data is relevant with transformed
the data values into à range of values that are easier to be
analyzed. This is done as the values for different information
are found to be in a varied range of scales.
For example, for a company, age values for employee can be
within the range of 20-55 years whereas salary values for
employees can be within the range of Rs. 10,000-Rs.1,00,000.
this indicates one column in a dataset can be more weighted
compared to others due to the varying range of values. In such
cases, applying statistical measures for data analysis across this
dataset may lead to unnatural or incorrect results.
Data transformation is hence required to solve this issue before
applying any analysis of data.
Various data transformation techniques are used during data
preprocessing. The choice of data transformation technique
depends on how the data will be later used for analysis.
Some of these important standard data preprocessing
techniques are Rescaling,Normalizing, Binarizing, Standardizing,
Label and One Hot Encoding.
Binarizing:
It is the process of converting data to either 0 or 1 based on a
threshold value.
All the data values above the threshold value are marked 1
whereas all the data values
equal to or below the threshold value are marked as O.
Data binarizing is done prior to data analysis in many cases such
as, dealing with crisp
values for the handling of probabilities and adding new
meaningful features in the dataset.
Standarizing:
standardization also called mean removal. It is the process of
transforming attributes having a Gaussian distribution( normal
distribution )with differing mean and standard deviation values
into a standard Gaussian distribution with a mean of 0 and a
standard deviation of 1.
In other words, Standardization is another scaling technique
where the values are centered around the mean with a unit
standard deviation.
This means that the mean of the attribute becomes zero and
the resultant distribution has a unit standard deviation.
standardization of data is done prior to data analysis in many
cases such as, in the case of linear discriminate analysis, linear
regression,
5. Label Encoding:
The label encoding process is used to convert textual labels into
numeric form in order to prepare it to be used in a machine
readable form.
The labels are assigned a value of 0 to (n-1) where n is the
number of distinct values for a particular categorical feature.
The numeric values are repeated for the same label of that
attribute. For instance, let us consider the feature 'gender'
having two values - male and female.
Using label encoding, each gender value will be marked with
unique numerical values starting with 0. Thus males will be
marked 0, females will be marked 1.
Dimensionality Reduction:
Dimensionality reduction is the transformation of data from
high dimensional space into a low dimensional space so
that low dimensional space representation retains nearly
all the information ideally saying all the information only by
reducing the width of the data.
Working with high dimensional space can be undesirable
for many reasons like raw data is mostly sparse and
results in high computational cost. Dimensionality
reduction is common in a field that deals with large
instances and columns.
It can be divided into two main components - feature selection
(also known as attribute subset selection) and feature
extraction.
Type 1
Feature Selection:
Feature selection is the process of deciding which
variables (features, characteristics, categories, etc.) are
most important to your analysis. These features will be
used to train ML models. It’s important to remember,
that the more features you choose to use, the longer
the training process and, sometimes, the less accurate
your results, because some feature characteristics may
overlap or be less present in the data.
Feature selection is the process of extracting a subset of features
from the original set of all features of a dataset to obtain a smaller
subset that can be used to model a given problem.
Few of the standard techniques used for feature selection are:
Type2:
Feature Extraction:
Feature extraction process is used to reduce the data in a high
dimensional space to a lower dimension space.
While feature selection chooses the most relevant features from
among a set of given features, feature extraction creates a new,
smaller set of features that consists of the most useful information.
Few of the methods for dimensionality reduction include Principal
Component Analysis (PCA), Linear Discriminant Analysis (LDA) and
Generalized Discriminant Analysis (GDA).
These methods are discussed below:
3. Numerosity Reduction:
Numerosity reduction reduces the data volume by choosing
alternative smaller forms of data representation.
Numerosity reduction method is used for converting the data
to smaller forms so as to reduce the volume of data.
Numerosity reduction may be either parametric or non
parametric as explained below
Discretization by Binning:
Binning is the famous methods of data discretization. Data
binning, bucketing is data pre-processing method used to
minimize the effects of small observation errors. The original
data values are divided into small intervals known as bins and
then the
are replaced by a general value calculated for that bin.
This has a smoothing effect on the input data and may also
reduce the chances of over fitting in case of small datasets.
Binning is a top-down splitting technique based on a specified
number of bins. These methods are also used as discretization
methods for data reduction and concept hierarchy generation.
Binning does not use class information and is therefore an
unsupervised discretization technique. It is sensitive to the
user-specified number of bins, as well as the presence of
outliers.
For example, attribute values can be discretized by applying
equal-width or equal-frequency binning, and then replacing
each bin value by the bin mean or median respectively.
These techniques can be applied recursively to the resulting
partitions to generate concept hierarchies.
Distributing of values into bins can be done in a number of
ways. One such way is called equal width binning in which the
data is divided into n intervals of equal size.
The width wof the interval is calculated as w = (max _value -
min_value) / n.
Another way of binning is called equal frequency binning in
which the data is divided into n groups and each group contains
approximately the same number of values as
shown in the example below:
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output:
[5,10, 11, 13, 15, 35, 50, 55, 72]
[92]
[204,215]
In Fig. 3.9, each bucket represents a different $10 range for price.
There are several partitioning rules, including the following: