Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
10 views

3Sampling_and_simulation

Uploaded by

iraamane1403
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

3Sampling_and_simulation

Uploaded by

iraamane1403
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Sampling and Simulation

Data Sampling
• In most studies, it is pretty hard to analyse a
whole population, so researchers use samples instead
• Data sampling is a statistical analysis technique used to
select, manipulate and analyze a representative subset
of data points to identify patterns and trends in the
larger data set
• It enables data scientists to work with a small,
manageable amount of data about a
statistical population to build and run analytical models
more quickly, while still producing accurate findings
Data Sampling
Advantages and challenges of data sampling
• useful with data sets that are too large to
efficiently analyze in full –
• for example, in big data analytics applications
or surveys.
• Identifying and analyzing a representative
sample is more efficient and cost-effective
than surveying the entirety of the data or
population.
Data Sampling
Advantages and challenges of data sampling

• An important consideration is the size of the required


data sample and the possibility of introducing
a sampling error.
• In some cases, a small sample can reveal the most
important information about a data set.
• In others, using a larger sample can increase the
likelihood of accurately representing the data as a
whole,
• Population: Based on the scope of the study,
the population includes all possible outcomes.
Sampling Frame: Contains the accessible
target population under study. We derive a
sample from the sampling frame.
• Sample: Subset of a population, selected
through various techniques
steps to follow to select a sample
• Establish the survey's objectives :initial users
and uses of the data, the type of data
• Define the target population :Nature of units,
Geographic location, the period covered by
the survey
• Decide on the data to be collected
• Set the level of precision: There is a level of
uncertainty associated with estimates from a
sample. It is the sampling error.
• Define target population: Based on the objective
of the study, clearly scope the target population.
For instance, if we are studying a regional election,
the target population would be all people who are
domiciled in the region that are eligible to vote.
• 2. Define Sampling Frame: The sampling frame is
the approachable members from the overall
population. In the above example, the sampling
frame would consist of all the people from the
population who are in the state and can participate
in the study
• 3 Select Sampling Technique:
• 4. Determine Sample Size: To ensure that we have an
unbiased sample, free from errors and that closely
represents the whole population, our sample needs to be of
an appropriate size. this is dependent on factors like the
complexity of the population under study, the researcher’s
resources and associated constraints. Also,
• 5. Collect the Data: We should attempt to ensure that we
don’t have too many empty fields in our data, and we
document the reasons in cases where the data is missing.
This helps in analysis, as this gives us perspective on how to
treat the missing data when we perform analysis.
• 6. Assess response rate: It is important to closely monitor
the response rate to ensure you make timely changes to
your sample collection approach and ensure you achieve
your determined sample collection.
The sample design

• 1. Determine what the survey population will


be (e.g. students, men aged 20 to 35,
• newborn babies, etc.).
• 2. Choose the most appropriate survey time
frame.
• 3. Define the survey units.
• 4. Establish the sample size (e.g. a sample of
100 from a population of 1,000).
• 5. Select a sampling method.
Types of sampling methods
• Probability sampling : every unit has a
probability of being selected that can be quantified
• Each element of the population
has known and non-zero probability of being in
the sample. This method is usually preferable,
since its properties, such as bias and sampling
error, are usually known.

• Sampling based on probability uses random


numbers
Data Sampling
• Non-probability sampling
depends on the data set and situation
• some elements of the population may not be
selected and there is a great risk of the sample
being non-representative of the population as
a whole.
• However, probability sampling can sometimes
not be possible under some circumstances, or
it can just be cheaper to do it non-randomly.
Data Sampling
• Sampling can also be based on nonprobability,
in which a data sample is determined and
extracted based on the judgment of the
analyst
• As inclusion is determined by the analyst, it
can be more difficult to extrapolate whether
the sample accurately represents the larger
population than when probability sampling is
used.
• Probability sampling is more complex, more
time-consuming and more costly than
non-probability sampling.
• As each unit’s selection probability can be
calculated, reliable estimates can be produced
Data Sampling : Probability sampling
• Simple random sampling: (SRS)
• Stratified sampling
• Cluster sampling:
• Multistage sampling:
• Systematic sampling:
Data Sampling: Probability sampling

• 1. Simple random sampling: Software is used to


randomly select samples from the whole
population.
• SRS can be done with or without replacement.
• SRS with replacement :the sampled telephone
entry may be selected twice or more
• SRS without replacement is more convenient
and gives more precise results
Simple random sampling without replacement (SRSWR)
This is probably the most obvious sampling
if you have a population of 1000 individuals and you can
only analyse 100, then you will randomly select one
individual at a time, until you have your sample of 100.
This will give each individual the same probability of being
in the sample.
• Advantages :
• It does not require any information on the
survey frame other than the complete list of
units of the survey population along with
contact information.
• SRS is a simple method and the theory behind
it is well established, standard formulas exist
to determine the sample size, the estimates
and so on, and these formulas are easy to use.
• Disadvantages :
• necessitates a list of all units of the population
• If collection has to be made in-person, SRS
could give a sample that is too spread out
across multiple regions, which could increase
costs and duration of the survey.
Data Sampling
• SRSWR is an unbiased sampling design,
meaning that we expect the parameters
calculated from the sample to be unbiased.
• you risk getting a really bad sample,
completely out of bad luck, and having results
that are not at all representative of your
population.
• In this case, stratifying your sample might
help.
• In practice, it is not that simple to get an actual simple
random sample.
• For election polls, for instance, how do you do it?
You can’t actually have a list of every person in the country
to randomly select from.
You can, for instance, have a list of all the personal phone
numbers available, and select from there.
The point is that you probably need a list of your whole
population to do this —
if you are randomly interviewing people in the streets,
it is actually not completely random: depending on
which location you choose to go to, your sample might
yield different results.
2. Data Sampling: Stratified sampling
• Stratified sampling: Subsets of the data sets or population
are created based on a common factor, and samples are
randomly collected from each subgroup.
• the population is divided into homogeneous, mutually
exclusive groups called strata, and then independent samples
are selected from each stratum.
• Any of the sampling methods can be used to sample
within each stratum.
• The sampling method can vary from one stratum to
another.
• A population can be stratified by any variable for which a
value is available for all units on the sampling frame prior
to sampling (e.g. age, sex, province of residence, income).
Data Sampling
• Stratified sampling
• Under certain conditions, it might actually be useful to stratify
your population, according to some features.
• Let’s say you want to do a survey with your company’s 1000
employees to see how happy they are at their jobs, but you only
have the time to interview 100 of them, so you take a sample.
• With a SRSWR, you could risk getting 50 guys from accounting and
no data scientists.
• This would make you think your company’s employees are much
unhappier than they actually are, since data scientists are the
happiest people at their jobs.
• In this case, what you can do, is split your population into
departments, and then sample randomly from each department,
taking samples that are proportional to the department size.
• If you create strata within which units share
similar characteristics and are considerably
different from units in other strata then you
would only need a small sample from each
stratum to get a precise estimate of total
income for that stratum.
• Then you could combine these estimates to
get a precise estimate of total income for the
whole population
Data Sampling
Stratified Sampling method can be really useful
under some conditions:
• Variability within strata is small (you know,
from previous studies, that people within the
same department tend to feel more or less
the same in terms of happiness at work)
• Variability between strata is big (your level of
happiness at work depends a lot on your
department)
Data Sampling
• In practice, it can be expensive and
complicated to implement. Since it needs
previous information on your population
Data Sampling : Probability sampling
• Simple random sampling: (SRS)
• Stratified sampling
• Cluster sampling: The larger data set is
divided into subsets (clusters) based on a
defined factor, then a random sampling of
clusters is analyzed.
Data Sampling
• Multistage sampling: A more complicated form of cluster
sampling, this method also involves dividing the larger
population into a number of clusters. Second-stage clusters
are then broken out based on a secondary factor, and those
clusters are then sampled and analyzed. This staging could
continue as multiple subsets are identified, clustered and
analyzed.
• Systematic sampling: A sample is created by setting an
interval at which to extract data from the larger population
• we essentially select every k-th element, also known as the
sampling interval, -- for example, selecting every 10th row in a
spreadsheet of 200 items to create a sample size of 20 rows
to analyze.
Difference
• In stratified random sampling, first, we use
common characteristics to divide the whole
population into strata and next we select
elements from each stratum. In clustering, we
divide the whole population into clusters and
then randomly pick clusters to form a sample
and not elements within clusters.
Data Sampling
• Poisson sampling
• In Poisson sampling design, every element on your population will
go through a Bernoulli trial, to define if they will be in the sample
or not.
• If the probability is the same for every element in the population,
this is a special case called Bernoulli sampling.
• It will also depend on having a list of every element in your
population.
• Let’s say you have a list of all the companies in your country, and
you want to survey them. You could assign a probability p for each
one of them to be in your sample, or even a different probability
for each, depending on their size, for instance (you might want to
give a greater weight to bigger companies).
• Note: in this case, you can’t know the exact size of your sample
beforehand — it is what we call a random size sampling design.
Data Sampling
Non-probability sampling
• Volunteer sampling
• It is a widely used method: it’s what you get when you post
a survey form on a Facebook group and ask people to fill it
for you. It’s easy and cheap, but it can lead to a lot of bias,
since you are actually sampling people who are on
Facebook, saw your post, and most importantly: that are
willing to fill that form for you. This might oversample
people who like you, or people who have enough free time
to fill in the form.
• It can be used as a first validation step to see if there might
be an interest in pursuing more expensive methods later
on.
Data Sampling
• Non-probability sampling
• Judgement sampling
• In this sampling design, you will choose your
sample based on your existing domain knowledge.
If you want to survey potential customers for a
new coding online course, you might already have
an idea of the type of people who would like it,
and start looking for them on LinkedIn.
• this method is prone to your own biases
Quota sampling

• In this, we divide the population into quotas that


represent the population,
. This might look similar to random sampling, but the
important difference is that we first divide the
population into fixed quotas. From these fixed
quotas, we select the sample.
• Quota could be something like all males above 20
or children between 12 and 18 years of age.
• Using quota sampling saves time and resources
and is a quick way to get the study started
Snowball sampling
• You first select, at random, members for the
sample. Suppose you selected 3 members. Now,
these three will suggest more names for the
study, and this creates a chain effect.
Snowball sampling is useful in cases where it is
difficult to locate people, or they do not wish to
be identified.
For instance, in medical research where you are
studying a rare disease, you might find that
snowball sampling is the only way you can get to
the desired sample size.
• Linear snowball sampling
• The chain grow linearly. Each member in the sample
refers to one more member.
• Exponential non-discriminative snowball sampling
• One to many relationships. Each member in the study
refers to multiple members
• Exponential discriminative snowball sampling
• while we will request the member to provide multiple
referrals, we will select only one out of these
• and nullify the remaining referrals.
• Convenience sampling: based on convenience : used
in the initial phases of the survey, where the
researcher intends to gain quick feedback on the
design of the survey.
Data Sampling
• Once you have sampled your data
– you will need to apply some feature
engineering to make sense out of it.
https://www.datamation.com/big-data/data-simulation/

Simulation
• The basic definition of data simulation
is taking a large amount of data and using it
to simulate or mirror real-world conditions to
either
• predict a future instance,
• determine the best course of action or
validate a model.
Simulation
There are many different forms of simulation of data.
• Some seek to approximate known conditions to determine,
for example, the likelihood that oil, gas or mineral resources
might be present within geological strata.
• Others take large troves of data and run a variety of
scenarios to see how different approaches might work.
• You see this kind of simulation in climate projections.
• Modelers run different scenarios based on existing
emissions, increasing emissions and lowered emissions to
estimate temperature levels decades into the future.
• The purists think of data simulation in a far narrower way.
They use it as a methodology to prove out a given model.
The model has to perform as expected under the data
simulation.
Simulation
• Data Simulation Features
• There are a number of different tools that
perform data simulation and their features vary
depending on the desired end result.
• The general the features may include:
• 1. Graphical user interface: the tool must be
accessible and used by many :
• The interface has to make it easy to formulate and
run various simulations.
Simulation
• 2. Model building: Data simulation is all about modeling.
Model building should be easy to accomplish and should be
done rapidly, supported by adequate compute power and
memory.
• Scalablity: better and faster models have resulted in the
demand to execute even large data simulations. Thus, tools
have come into the market that can massively scale to
accommodate the need for large data sets and huge
research experiments or simulations.
• Analytics integration: Data simulation goes hand in glove
with analytics, most tools offer the capability.
• Data import and export: Models require that data sets be
imported to the model and exported from it.
Simulation
• Data Simulation Benefits
• The ability to model behaviour across complex
systems.
• Using simulated data to produce a model that is
relatively realistic.
• Visualization of trends and model results
• Comparison of different scenarios to determine the
ideal course of action or the consequences of an
intended course.
• Business insight to illuminate top management
strategy, and direct promotional, sales, or marketing
efforts.
Simulation
• Data Simulation Uses
• Some of the main uses of simulations are to verify
analytical solutions, experiment policies before
creating any physical implementation and
understand the connection and relative
importance of the different variables composing a
system
• (e.g. by modifying input parameters and
examining the results).

Simulation
Data Simulation Uses
• Application development: Another variation is where data
for subsequent analysis does not exist yet, so for
downstream app tool development, data is created and
simulated to feed into the apps as they are being
developed.
• Oil and gas: The oil and gas industry performs large-scale
simulations of data sets.
• Over the decades, the industry has amassed databases of
rock formation using older methods.
• New simulation and modelling tools can go through this
data and simulate it against modern 3D scans
• So that it can be done without sending out another drilling
team. In fact, it is being used to avoid the norm of dozens
of drilling failures in trying to find oil or gas reserves.
• A downstream system is a system that
receives data from the Collaboration Server
system. You can load data into the
Collaboration Server system at regular
intervals (weekly, daily, or hourly) from an
upstream system
Simulation
Data Simulation Uses
• Digital twins: These are data simulations of actual physical
equipment, such as a gas turbine, a power plant or other
industrial facility.
• The idea is to gather data from the physical system and
create a digital copy or twin of the real thing.
• Engineers can then run simulations on the twin based on
running it hotter, faster, changing certain configurations,
figuring out how to reduce maintenance costs, increase
output or other scenarios.
• And all without actually doing anything to the physical
equipment.
• The twin helps those involved to better know how to
proceed in the real world.
Simulation
Data Simulation Software :
There are various simulation tools for many verticals.
• SAS
• Siemens EDA
• Rockwell Automation
• Hexagon Manufacturing Intelligence
• 4X Diagnostics
• Aquaveo
• Ansys
• MathWorks
• Dassault Systemes
• AutoDesk
Simulation
Modelling and Simulations in Data Science
One of the main limitations of the current state of
Machine Learning and Deep Learning, is the
constant need for new data.
Example -let’s consider a thought experiment: we
are working as a Data Scientist for a local
authority and we are asked to find a way in order
to optimise how the evacuation plan should work
in case of a natural disaster (e.g. volcanic
eruption, eartquake, etc…).
In this situation, because natural disasters don’t
tend to happen too frequently (fortunately!), we
don’t have any data currently available.
Simulation
Modelling and Simulations in Data Science
There are two main types of programmable simulation
models:

• Mathematical Models: make use of mathematical


symbols and relationships in order to summarise
processes. Compartmental Models in Epidemiology
are a typical example of mathematical models (e.g.
SIR, SEIR, etc…).
• Process Models: are based on a list of steps
handcrafted by the designer in order to represent an
environment (e.g. Agent-Based Modelling).
Simulation
• Modelling and Simulations are used in many different
fields such as finance (e.g.Monte Carlo Simulations for
Portfolio Optimization), medical/military training,
epidemiology and threat modelling .
• (Epidemiology is the study of how often diseases occur
in different groups of people and why. )

• Once having run many different simulations and


tested all the different possible scenarios, we can then
make use of the generated data in order to train our
Machine Learning model of choice to make predictions
in the real world.
Simulation and Predictive Analytics
https://www.anylogic.com/blog/predictive-analytics-using-simulation-models/

• Simulation and predictive analytics are related because both


require models.
• Simulations model the behaviour of a system, while predictive
analytics uses models for insights into the future.

• Comparison decision tree vs machine learning approach:


• 1. In predictive analytics, it is possible to model straightforward
systems with decision trees. For large data sets and complex
systems, regression or neural-network-based machine learning
may be better options.
• 2. Decision trees will indicate if something will or will not happen
depending on inputs. systems based on machine learning can
specify a value, such as when to schedule maintenance.
• 3. Another difference their data requirements-
Modelers can construct decision trees from
limited historical data, while machine learning
requires large amounts of training data.
For machine learning, this data usually comes
from historical data sets or continuous
feedback but may also be synthesized using a
simulation model.

You might also like