Excel Normal Distribution Functions
Excel Normal Distribution Functions
(I created the following figures in Excel. If you would like to know how I created them, see How to
Create Normal Curves with Shaded Areas in New Excel.)
(See Excel Random Numbers … A Reader’s Comment for information related to this article.)
When a reader asked me how to generate a random number from a Normal distribution she set me to
thinking about doing statistics with Excel.
Many of us were introduced to statistics in school and then forgot what little we learned...often within
seconds of the final exam. Also, when we took statistics, many of us weren't taught how to use it with
Excel. This is unfortunate, because in business it's often useful to have some grasp of that topic.
For all these reasons, I thought it would be worthwhile to briefly explore normal -- or "bell-shaped" --
curves in Excel. This is a commonly used area of statistics, and one for which Excel provides several
useful functions.
One interesting thing about the normal curve is that it occurs frequently in many different settings:
As a final example, here's a surprising occurrence of the normal curve: Take any population, whether
it's normally distributed or not. Randomly select at least 30 members from that population, measure
them for some characteristic, and then find the average of those measures. That average is one data
point. Return the samples, select another random sample of the same number, and find the average
of their measures. Do the same again and again. The Central Limit Theorem says that those averages
tend to have a normal distribution.
Normal distributions are all around us. Therefore, as painlessly as possible, let's take a closer look at
how we work with them using Excel.
Brief Definitions
We need to get some brief definitions out of the way so that we can
start to describe data using Excel functions.
From cholesterol to zebra stripes, the normal probability distribution describes the proportion of a
population having a specific range of values for an attribute. Most members have amounts that are
near the average; some have amounts that are farther away from the average; and some have
amounts extremely distant from the average.
For example, a population could be all the stripes on all the zebras in the world. The normal curve
would show the proportion of stripes that have various widths.
The standard deviation of a sample is a measure of the spread of the sample from its mean. (We're
taking about many items in a "sample," of course, not just a single item.) In a normal distribution,
about 68% of a sample is within one standard deviation of the mean. About 95% is within two
standard deviations. And about 99.7% is within three standard deviations. The numbers in the figure
above mark standard deviations from the mean.
The z value is the distance between a value and the mean in terms of standard deviations. In the
figure above, each number is a z value.
Beginning with Excel 2007, Microsoft updated many of their statistics functions. To provide backward
compatibility, they changed the names of their updated functions by adding periods within the name.
I show both versions in this article, but Microsoft recommends that you use the new version if you use
New Excel.
Several of the following functions require a value for the standard deviation. There are at least two
ways to find that value.
First, if you have a sample of the data, you can estimate the standard deviation from the sample using
one of these formulas:
=STDEV.S(range_of_values)
=STDEV(range_of_values)
On the other hand, if you're working with the entire population, you calculate the standard deviation
using:
=STDEV.P(range_of_values)
However, if you're working with rough estimates, you must take a different approach, because you
don't have actual data to support your estimates.
In this case, first calculate the range. This is the smallest likely value subtracted from the largest likely
value. By likely, let's use the assumption that all possible values will be within that range about 95%
of the time.
Remember that about 95% of a sample is within two standard deviations on each side of the mean.
(This is a total of four standard deviations, of course.) Therefore, if we divide the range by four we
should have the approximate standard deviation.
Merely dividing the range by four might seem to be a slipshod approach. But consider the way this
calculation often is used.
Suppose you're forecasting sales for next year. You think sales will be about 1,000, but the number
could be as high as 1,200 and as low as 800. With that information, you can put a normal curve
around your estimated sales and begin to generate a variety of forecasts for profits and cash flow.
To emphasize, these numbers are only your best estimates. Therefore, using an estimated standard
deviation doesn't seem quite as sloppy as it otherwise might.
Based on these estimates, your mean sales will be about 1,000 and your standard deviation will be
about (1200 - 800) / 4 = 100. With this information, you can use the following functions to perform
many of the calculations you will need in your analysis.
Therefore, the percentage of women taller than 68 inches is 1 - 84.13%, or approximately 15.87%.
This value is represented by the shaded area in the chart above.
NORM.S.DIST(z, cumulative)
NORMSDIST(z)
NORM.S.DIST translates the number of standard deviations (z) into cumulative probabilities.
(The probability mass function, PMF, gives the probability that a discrete -- that is, non-continuous --
random variable is exactly equal to some value.)
To illustrate:
Therefore, the probability of a value being within one standard deviation of the mean is the difference
between these values, or 68.27%. This range is represented by the shaded area of the chart.
The figure shows the area represented by the 25% of the American women who are shorter than this
height.
NORM.S.INV(probability)
NORMSINV(probability)
To illustrate, suppose you care about the half of the sample that's closest to the mean. That is, you
want the z values that mark the boundary that is 25% less than the mean and 25% more than the
mean.
The following two formulas provide those boundaries of -.674 and +.674, as illustrated by the figure.
=NORM.S.INV(0.25)
=NORM.S.INV(0.75)
=NORMSINV(0.25)
=NORMSINV(0.75)
Also, remember that the RAND() function returns a random number between 0 and 1. That is, RAND()
generates random probabilities. Therefore, you can use the NORM.INV function to calculate a random
number from a normal distribution, using this formula:
However, if you use Classic Excel with a large number of standard deviations, you might want to use a
different approach. Nearly ten years ago, Jerry W. Lewis -- a former Excel MVP and a professional
statistician -- offered a stern warning. Prior to Excel XP (2002), he wrote, NORMINV "produced a very
un-normal fraction of values around 6 million standard deviations from the mean."
=SQRT(-2*LN(RAND()))*SIN(2*PI()*RAND())
The Box-Muller method is mathematically exact, Jerry writes, if implemented with a perfect uniform
random number generator and infinite precision.
As I mentioned earlier, I created all of the figures for this article in Excel. If you would like to know
how I created them, see How to Create Normal Curves In Excel, With Shaded Areas.
References
I must have at least 15 statistics books gathering dust on bookshelves in the basement. Even so,
these two books offered clear explanations that you might find useful: