Department of Computer Science: Prepared By: Ms. Zainab Imtiaz

Department of Prepared By:
Computer Science Ms. Zainab Imtiaz

Data Preprocessing
Data preprocessing/preparation is the process of detecting and correcting (or removing) corrupt or
inaccurate records from a dataset, or it refers to identifying incorrect, incomplete, irrelevant parts of the
data and then modifying, replacing, or deleting the unnecessary data.
Data preprocessing is extremely important because it allows improving the quality of the raw
experimental data.
There are five stages of preprocessing.
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature Subset Selection
• Data Cleaning
Data Aggregation
Data aggregation is any process in which information is gathered and expressed in a

summary form, for purposes such as statistical analysis. A common aggregation purpose is
to get more information about particular groups based on specific variables such as age,
profession, or income.
Sampling
Sampling is a process in which a predetermined number of
observations are taken from a larger population. The
methodology used to sample from a larger population depends
on the type of analysis being performed, but it may include
simple random sampling or systematic sampling.
1. Simple random sampling
In a simple random sample, every member of the
population has an equal chance of being selected. Your
sampling frame should include the whole population.
Example
You want to select a simple random sample of 100
employees of Company X. You assign a number to every
employee in the company database from 1 to 1000, and
use a random number generator to select 100 numbers.
2. Systematic sampling
Systematic sampling is similar to simple random sampling, but
it is usually slightly easier to conduct. Every member of the
population is listed with a number, but instead of randomly
generating numbers, individuals are chosen at regular intervals.
Example
All employees of the company are listed in alphabetical order.
From the first 10 numbers, you randomly select a starting
point: number 6. From number 6 onwards, every 10th person
on the list is selected (6, 16, 26, 36, and so on), and you end up
with a sample of 100 people.
3. Stratified sampling
This sampling method is appropriate when the population has
mixed characteristics, and you want to ensure that every
characteristic is proportionally represented in the sample.
You divide the population into subgroups (called strata) based
on the relevant characteristic (e.g. gender, age range, income
bracket, job role).
From the overall proportions of the population, you calculate
how many people should be sampled from each subgroup.
Then you use random or systematic sampling to select a sample
from each subgroup.
Example
The company has 800 female employees and 200 male
employees. You want to ensure that the sample reflects the
gender balance of the company, so you sort the population into
two strata based on gender. Then you use random sampling on
each group, selecting 80 women and 20 men, which gives you a
representative sample of 100 people.
4. Cluster sampling
Cluster sampling also involves dividing the population into
subgroups, but each subgroup should have similar
characteristics to the whole sample. Instead of sampling
individuals from each subgroup, you randomly select entire
subgroups.
If it is practically possible, you might include every individual
from each sampled cluster. If the clusters themselves are large,
you can also sample individuals from within each cluster using
one of the techniques above.
This method is good for dealing with large and dispersed
populations, but there is more risk of error in the sample, as
there could be substantial differences between clusters. It’s
difficult to guarantee that the sampled clusters are really
representative of the whole population.
Example
The company has offices in 10 cities across the country (all with
roughly the same number of employees in similar roles). You
don’t have the capacity to travel to every office to collect your
data, so you use random sampling to select 3 offices – these
are your clusters.
Dimensionality Reduction
In machine learning classification problems, there are often too many

factors on the basis of which the final classification is done. These
factors are basically variables called features. The higher the number
of features, the harder it gets to visualize the training set and then
work on it. Sometimes, most of these features are correlated, and
hence redundant. This is where dimensionality reduction algorithms
come into play. Dimensionality reduction is the process of reducing the
number of random variables under consideration, by obtaining a set of
principal variables.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction

include:
•Principal Component Analysis (PCA)

•Linear Discriminant Analysis (LDA)
•Generalized Discriminant Analysis (GDA)
Feature Selection
Feature selection is critical to building a good model for several reasons. One is that feature
selection implies some degree of cardinality reduction, to impose a cutoff on the number of
attributes that can be considered when building a model. Data almost always contains more
information than is needed to build the model, or the wrong kind of information.
For example, you might have a dataset with 500 columns that describe the characteristics of
customers; however, if the data in some of the columns is very sparse you would gain very little
benefit from adding them to the model, and if some of the columns duplicate each other, using
both columns could affect the model.

Department of Computer Science: Prepared By: Ms. Zainab Imtiaz

Uploaded by

Copyright:

Available Formats

Department of Computer Science: Prepared By: Ms. Zainab Imtiaz

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Department of Computer Science: Prepared By: Ms. Zainab Imtiaz

Uploaded by

Copyright:

Available Formats

Department of Prepared By:

Computer Science Ms. Zainab Imtiaz

There are five stages of preprocessing.

Data aggregation is any process in which information is gathered and expressed in a

In machine learning classification problems, there are often too many

The various methods used for dimensionality reduction

•Principal Component Analysis (PCA)

You might also like