3Sampling_and_simulation
3Sampling_and_simulation
Data Sampling
• In most studies, it is pretty hard to analyse a
whole population, so researchers use samples instead
• Data sampling is a statistical analysis technique used to
select, manipulate and analyze a representative subset
of data points to identify patterns and trends in the
larger data set
• It enables data scientists to work with a small,
manageable amount of data about a
statistical population to build and run analytical models
more quickly, while still producing accurate findings
Data Sampling
Advantages and challenges of data sampling
• useful with data sets that are too large to
efficiently analyze in full –
• for example, in big data analytics applications
or surveys.
• Identifying and analyzing a representative
sample is more efficient and cost-effective
than surveying the entirety of the data or
population.
Data Sampling
Advantages and challenges of data sampling
Simulation
• The basic definition of data simulation
is taking a large amount of data and using it
to simulate or mirror real-world conditions to
either
• predict a future instance,
• determine the best course of action or
validate a model.
Simulation
There are many different forms of simulation of data.
• Some seek to approximate known conditions to determine,
for example, the likelihood that oil, gas or mineral resources
might be present within geological strata.
• Others take large troves of data and run a variety of
scenarios to see how different approaches might work.
• You see this kind of simulation in climate projections.
• Modelers run different scenarios based on existing
emissions, increasing emissions and lowered emissions to
estimate temperature levels decades into the future.
• The purists think of data simulation in a far narrower way.
They use it as a methodology to prove out a given model.
The model has to perform as expected under the data
simulation.
Simulation
• Data Simulation Features
• There are a number of different tools that
perform data simulation and their features vary
depending on the desired end result.
• The general the features may include:
• 1. Graphical user interface: the tool must be
accessible and used by many :
• The interface has to make it easy to formulate and
run various simulations.
Simulation
• 2. Model building: Data simulation is all about modeling.
Model building should be easy to accomplish and should be
done rapidly, supported by adequate compute power and
memory.
• Scalablity: better and faster models have resulted in the
demand to execute even large data simulations. Thus, tools
have come into the market that can massively scale to
accommodate the need for large data sets and huge
research experiments or simulations.
• Analytics integration: Data simulation goes hand in glove
with analytics, most tools offer the capability.
• Data import and export: Models require that data sets be
imported to the model and exported from it.
Simulation
• Data Simulation Benefits
• The ability to model behaviour across complex
systems.
• Using simulated data to produce a model that is
relatively realistic.
• Visualization of trends and model results
• Comparison of different scenarios to determine the
ideal course of action or the consequences of an
intended course.
• Business insight to illuminate top management
strategy, and direct promotional, sales, or marketing
efforts.
Simulation
• Data Simulation Uses
• Some of the main uses of simulations are to verify
analytical solutions, experiment policies before
creating any physical implementation and
understand the connection and relative
importance of the different variables composing a
system
• (e.g. by modifying input parameters and
examining the results).
•
Simulation
Data Simulation Uses
• Application development: Another variation is where data
for subsequent analysis does not exist yet, so for
downstream app tool development, data is created and
simulated to feed into the apps as they are being
developed.
• Oil and gas: The oil and gas industry performs large-scale
simulations of data sets.
• Over the decades, the industry has amassed databases of
rock formation using older methods.
• New simulation and modelling tools can go through this
data and simulate it against modern 3D scans
• So that it can be done without sending out another drilling
team. In fact, it is being used to avoid the norm of dozens
of drilling failures in trying to find oil or gas reserves.
• A downstream system is a system that
receives data from the Collaboration Server
system. You can load data into the
Collaboration Server system at regular
intervals (weekly, daily, or hourly) from an
upstream system
Simulation
Data Simulation Uses
• Digital twins: These are data simulations of actual physical
equipment, such as a gas turbine, a power plant or other
industrial facility.
• The idea is to gather data from the physical system and
create a digital copy or twin of the real thing.
• Engineers can then run simulations on the twin based on
running it hotter, faster, changing certain configurations,
figuring out how to reduce maintenance costs, increase
output or other scenarios.
• And all without actually doing anything to the physical
equipment.
• The twin helps those involved to better know how to
proceed in the real world.
Simulation
Data Simulation Software :
There are various simulation tools for many verticals.
• SAS
• Siemens EDA
• Rockwell Automation
• Hexagon Manufacturing Intelligence
• 4X Diagnostics
• Aquaveo
• Ansys
• MathWorks
• Dassault Systemes
• AutoDesk
Simulation
Modelling and Simulations in Data Science
One of the main limitations of the current state of
Machine Learning and Deep Learning, is the
constant need for new data.
Example -let’s consider a thought experiment: we
are working as a Data Scientist for a local
authority and we are asked to find a way in order
to optimise how the evacuation plan should work
in case of a natural disaster (e.g. volcanic
eruption, eartquake, etc…).
In this situation, because natural disasters don’t
tend to happen too frequently (fortunately!), we
don’t have any data currently available.
Simulation
Modelling and Simulations in Data Science
There are two main types of programmable simulation
models: