Input Modeling For Simulation
Input Modeling For Simulation
Input Models
Input models represent the uncertainty in a stochastic simulation.
Random variates are generated based on the input models Simulation outputs are determined by the input models
Input Models
The fundamental requirements for an input model are:
It must be capable of representing the physical realities of the process. It must be easily tuned to the situation at hand. It must be amenable to random-variate generation.
There is no true model for any stochastic input. The best that we can hope is to obtain an approximation that yields useful results.
3
Input Models
A key distinction in input modeling problems is the presence or absence of data:
When we have data, then we fit a model to the data. Good software is available for this. When no data are available then we have to creatively use what we can get to construct an input model.
Outline
Input modeling with data
Data collection Select candidate distributions Fitting and checking Arena Input Analyzer
If few data points are available: combine adjacent cells to eliminate the ragged appearance of the histogram
8
10
binomial: Models the number of successes in n trials, when the trials are independent with common success probability, p. Example: the number of defective components found in a lot of n components. negative binomial: Models the number of trials required to achieve k successes. Example: the number of components that we must inspect to find 4 defective components. Poisson: Models the number of independent events that occur in a fixed amount of time or space. Ex: number of customers that arrive to a store during 1 hour, or number of defects found in 30 square meters of sheet metal. normal: Models the distribution of a process that can be thought of as the sum of a number of component processes. Ex: the time to assemble a product which is the sum of the times required for each assembly operation.
11
lognormal: Models the distribution of a process that can be thought of as the product of a number of component processes. Example: the rate of return on an investment, when interest is compounded, is the product of the returns for a number of periods. Also widely used to model stock prices. exponential: Models the time between independent events, or a process time which is memoryless. Example: the time to failure for a system that has constant failure rate over time. Note: if the time between events is exponential, then the number of events is Poisson. Erlang: The sum of k identical exponential random variables. A special case of the gamma... gamma: An extremely flexible distribution used to model nonnegative random variables.
12
beta: An extremely flexible distribution used to model bounded (fixed upper and lower limits) random variables. Weibull: Models the time to failure (minimum of a number of possible causes); can model increasing or decreasing failure rate hazard. Ex: the time to failure for a disk drive. discrete or continuous uniform: Models complete uncertainty, since all outcomes are equally likely. triangular: Models a process when only the minimum, most likely and maximum values of the distribution are known. Ex: the minimum, most likely and maximum inflation rate we will have this year. empirical: Reuses the data themselves by making each observed value equally likely. Can be interpolated to obtain a continuous distribution.
13
Fitting
Determine the unknown parameters for the distribution. For example, if you believe the data is from normal distribution, then you need to determine the mean and variance of the distribution. Common methods for fitting distributions are maximum likelihood, method of moments, and least squares.
While the method matters, the variability in the data often overwhelms the differences in the estimators (see Section 9.3). Remember: There is no true distribution just waiting to be found!
14
Graphic analysis
Histogram with the fitted line q-q plot
15
Goodness-of-fit Tests
In the test... H0: the chosen distribution fits the data H1: the chosen distribution does not fit the data The p-value of a test is the Type I error level (significance) at which we would just reject H0 for the given data. If the level is less than p-value, we do not reject H0; otherwise, we reject H0. Thus, a large (> 0.10) p-value supports H0 that the distribution fits.
16
Goodness-of-fit Tests
If there are little data, the goodness-of-fit test is likely to accept any distributions. Why?
If there are lots of data, the goodness-of-fit test is likely to reject any distributions. Why?
17
Chi-squared Test
A histogram-based of test
Observed Frequency
X <= 0.0000 0.0% 9 8 7 X <= 40.000 99.9%
Values x 10^-2
6 5 4 3 2 1 0 0 5 10 15 20
2 ( ) O E i 02 = i Ei i =1 k
25
30
35
40
Expected Frequency Ei = n*pi where pi is the theoretical prob. of the ith interval. 18
Chi-square Test
Arrange the n observations into a k cells, the test statistics is:
2 0
i =1
(Oi Ei ) 2 Ei
which approximately follows the chi-square distribution with k-s-1 degrees of freedom, where s = # of parameters of the hypothesized distribution estimated by the sample statistics.
Valid only for large sample size Each cell has at least 5 observations for both Oi and Ei Result of the test depends on grouping of the data
19
Chi-squared Test
Vehicle arrival example (page 9), sample mean 3.64 H0: Data are Poisson distributed with mean 3.64 H1: Data are not Poisson distributed with mean 3.64
xi 0 1 2 3 4 5 6 7 8 9 10 > 11 Observed Frequency, Oi 12 10 19 17 19 6 7 5 5 3 3 1 100 Expected Frequency, Ei 2.6 9.6 17.4 21.1 19.2 14.0 8.5 4.4 2.0 0.8 0.3 0.1 100.0 (Oi - Ei)2/Ei 7.87 0.15 0.8 4.41 2.57 0.26 11.62 27.68
Ei = np ( x) e x =n x!
Degree of freedom is k-s-1 = 7-1-1 = 5 and so the p-value is 0.00004. What is your conclusion?
20
Kolmogorov-Smirnov Test
X <= 0.0000 0.0% 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 X <= 40.000 98.5%
Kolmogorov-Smirnov Test
Empirical Distribution
If we have n observations x1,x2,,xn, then Sn(x) = (number of x1,x2,,xn that are x) / n
1
x1
x2 x3
x4
22
Graphic Analysis
Graphic analysis includes: histogram with fitted distribution and q-q plot. Goodness-of-fit tests represent lack of fit by a summary statistic, while plots show where the lack of fit occurs and whether it is important. Goodness-of-fit tests may accept the fit, but the plots may suggest the opposite, especially when the number of observations is small.
23
Graphic Analysis
A data set of 30 observations is believed to be from a normal distribution. The following are the p-values from chi-square test and K-S test: Chi-square test: 0.166 K-S test: >0.15 What is your conclusion?
24
25
Histogram Plot
26
Text Report
Distribution Summary Distribution: Lognormal Expression: 3 + LOGN(9.22, 7.13) Square Error: 0.004997 Chi Square Test Number of intervals Degrees of freedom Test Statistic Corresponding p-value Data Summary Number of Data Points Min Data Value Max Data Value Sample Mean Sample Std Dev Histogram Summary Histogram Range Number of Intervals = 3 to 37 = 14 = = = = = 200 3.38 36.1 12 5.91
= = = =
8 5 9.13 0.106
27
Usage Notes
The Fit All option tries all relevant distributions and picks the one with the smallest squared error. It can easily be fooled! Be sure to try different numbers of histogram cells; it affects the p-value of the 2 test, and your perception of the fit. Since EXPO is a special case of ERLA which is a special case of GAMMA, Fit All rarely selects EXPO or ERLA. Similarly, EXPO is a special case of WEIB. Raw data can be read in from text files (looks for .dst), one value per line.
28
Distribution Summary Distribution: Expression: Square Error: Weibull -0.5 + WEIB(4.59, 1.51) 0.006067 =6 =3 = 2.97 = 0.414
Data Summary Number of Data Points Min Data Value Max Data Value Sample Mean Sample Std Dev = 100 =0 = 11 = 3.64 = 2.76
Chi Square Test Number of intervals Degrees of freedom Test Statistic Corresponding p-value
Distribution Summary Distribution: Expression: Square Error: Poisson POIS(3.64) 0.025236 =6 =4 = 19.8 < 0.005
Data Summary Number of Data Points Min Data Value Max Data Value Sample Mean Sample Std Dev = 100 =0 =11 = 3.64 = 2.76
Chi Square Test Number of intervals Degrees of freedom Test Statistic Corresponding p-value
30
q-q Plot
Recall that one way to generate data from cdf F is via
Y = F (R)
The q-q plot displays the sorted data
Y1 Y2 L Yn
vs.
1 / 2 1 3 / 2 1 n 1 / 2 F , K , F , F n n n
1
31
32
25
40 35 30
20
25 20
15
15 10 5 5 10 15
10
20
25
30
35
40
45
50
34
25
20
A data set of 30 observations is believed to be from a normal distribution. The following are the p-values from chi-square test and K-S test:
Poor fit! Miss badly in the left tail
15
10 10 15 20 25 30
35
Empirical distribution
Each data point is equally likely to be resampled. In Arena: DISCRETE(1/n, X1, 2/n, X2,, 1, Xn)
Interpolated Empirical
To fill in gaps, we interpolate between the sorted data points. In Arena:
CONT(0, 2.1, .33, 3.4, .67, 5.7, 1, 8.1) CONT(0, X1, 1/(n-1), X2, 2/(n-1), X3,, 1, Xn)
Interpolated Empirical cdf
cumulative probability 1 0.8 0.6 0.4 0.2 0 0 2 0 4 X 6 8 10 0.33 0.67 1
38
39
40
Breakpoints Method
Useful for modeling quantities with a large number of possible outcomes, like quarterly sales volume or aggregate number of overtime hours. Minimum data needed: smallest and largest possible values.
Ex: sales of XYZ-123 will be no less than 1000 units, but no more than 5000 units. UNIF(1000,5000) Comments: This is typically a poor model since the probability is spread out evenly from low to high. Thus, the extremes are just as likely as the middle values. However, if you want to be conservative (maximum uncertainty) or you cannot justify any additional information, then this model may be reasonable.
41
Breakpoints Method
Better data: smallest and largest possible values, and most likely value.
Ex: sales of XYZ-123 will be no less than 1000 units, no more than 5000 units, and is most likely to be 3500 units. Triangular distribution, TRIA(1000,3500,5000)
Uniform(1000, 5000)
3.0
6
2.5
Values x 10^-4
2.0
Values x 10^-4
1.5
1.0
0.5
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
5.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Values in Thousands
100.0% 0.5000 5.5000
0.5000
Values in Thousands
100.0% 5.5000
5.5
42
Breakpoints Method
Best data: smallest and largest possible values plus 1-3 breakpoints (values and a percentage chance of being less than that value).
Ex: sales of XYZ-123 will be between 1000 and 5000, and sales chance of being sales 2000 25% 3500 75% 4500 99%
In Arena: CONT(0, 1000, .25, 2000, .75, 3500, .99, 4500, 1, 5000) Comments:
Use only as many breakpoints as you can confidently get. Three is usually the maximum if no data are available. Try to get breakpoints near the extremes if possible, since the extremes are often not realistic. Sometimes it is easier to get people to give the chance of exceeding a value, rather than being less than a value.
43
44
46
Discrete Outcomes
Used to model discrete events, like making or not making a sale, or whether a new product is ready to ship in the first, second or third quarter. Data: A percentage chance of each possible outcome so that the total is 100%.
47
Discrete Outcomes
Example: We have an 80% chance of landing the contract with Big Corp. If we land the contract, then there is only a 10% chance that the initial orders will arrive first quarter; a 50% chance they will start the second quarter; and a 40% chance they will start in quarter 3. Sales will be $250K in each quarter we produce. How to model the sales for the year related to Big Corp?
48