Department of Computer Science: Prepared By: Ms. Zainab Imtiaz
Department of Computer Science: Prepared By: Ms. Zainab Imtiaz
Department of Computer Science: Prepared By: Ms. Zainab Imtiaz
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature Subset Selection
• Data Cleaning
Data Aggregation
Sampling
Sampling is a process in which a predetermined number of
observations are taken from a larger population. The
methodology used to sample from a larger population depends
on the type of analysis being performed, but it may include
simple random sampling or systematic sampling.
1. Simple random sampling
In a simple random sample, every member of the
population has an equal chance of being selected. Your
sampling frame should include the whole population.
Example
You want to select a simple random sample of 100
employees of Company X. You assign a number to every
employee in the company database from 1 to 1000, and
use a random number generator to select 100 numbers.
2. Systematic sampling
Systematic sampling is similar to simple random sampling, but
it is usually slightly easier to conduct. Every member of the
population is listed with a number, but instead of randomly
generating numbers, individuals are chosen at regular intervals.
Example
All employees of the company are listed in alphabetical order.
From the first 10 numbers, you randomly select a starting
point: number 6. From number 6 onwards, every 10th person
on the list is selected (6, 16, 26, 36, and so on), and you end up
with a sample of 100 people.
3. Stratified sampling
This sampling method is appropriate when the population has
mixed characteristics, and you want to ensure that every
characteristic is proportionally represented in the sample.
You divide the population into subgroups (called strata) based
on the relevant characteristic (e.g. gender, age range, income
bracket, job role).
From the overall proportions of the population, you calculate
how many people should be sampled from each subgroup.
Then you use random or systematic sampling to select a sample
from each subgroup.
Example
The company has 800 female employees and 200 male
employees. You want to ensure that the sample reflects the
gender balance of the company, so you sort the population into
two strata based on gender. Then you use random sampling on
each group, selecting 80 women and 20 men, which gives you a
representative sample of 100 people.
4. Cluster sampling
Cluster sampling also involves dividing the population into
subgroups, but each subgroup should have similar
characteristics to the whole sample. Instead of sampling
individuals from each subgroup, you randomly select entire
subgroups.
If it is practically possible, you might include every individual
from each sampled cluster. If the clusters themselves are large,
you can also sample individuals from within each cluster using
one of the techniques above.
This method is good for dealing with large and dispersed
populations, but there is more risk of error in the sample, as
there could be substantial differences between clusters. It’s
difficult to guarantee that the sampled clusters are really
representative of the whole population.
Example
The company has offices in 10 cities across the country (all with
roughly the same number of employees in similar roles). You
don’t have the capacity to travel to every office to collect your
data, so you use random sampling to select 3 offices – these
are your clusters.
Dimensionality Reduction
Feature selection is critical to building a good model for several reasons. One is that feature
selection implies some degree of cardinality reduction, to impose a cutoff on the number of
attributes that can be considered when building a model. Data almost always contains more
information than is needed to build the model, or the wrong kind of information.
For example, you might have a dataset with 500 columns that describe the characteristics of
customers; however, if the data in some of the columns is very sparse you would gain very little
benefit from adding them to the model, and if some of the columns duplicate each other, using
both columns could affect the model.