Exploring and Producing Data For Business Decision Making Module 2
Exploring and Producing Data For Business Decision Making Module 2
2
Table of Contents
1. Preface
2. Module 2 Exploring and Producing Data for Business Decision Making
1. Lesson 2-1 Measures of Central Tendencies
1. Lesson 2-1.1 Measures of Central Tendencies
2. Lesson 2-1.2 Mean and Median in Excel
2. Lesson 2-2 Measures of Dispersion
1. Lesson 2-2.1 Measures of Dispersion
2. Lesson 2-2.2 Standard Deviation in Excel
3. Lesson 2-3 Percentiles and Z-Score
1. Lesson 2-3.1 Percentiles and Z-Score
2. Lesson 2-3.2 Z-Score in Excel
4. Lesson 2-4 Discrete and Continuous Random Variables
1. Lesson 2-4.1 Discrete and Continuous Random Variables
2. Lesson 2-4.2 Expected Value in Excel
5. Lesson 2-5 Normal Distribution
1. Lesson 2-5.1 Normal Distribution
2. Lesson 2-5.2 Normal Distribution in Excel
3. Lesson 2-5.3 Standard Normal Distribution in Excel
4. Lesson 2-5.4 "Less Than" in Excel
5. Lesson 2-5.5 "Greater Than" in Excel
6. Lesson 2-5.6 "Between Values" in Excel
7. Lesson 2-5.7 Finding Z for a Given Probability (I N VINV) in Excel
6. Lesson 2-6: Standard Normal Distribution Table
1. Lesson 2-6.1 Finding Z for a Given Probability (I N VINV) in Excel
3
Preface
Thank you for choosing a Gies eBook.
This Gies eBook is based on an extended video lecture transcript made from Module 2 of
Professor Fataneh Taghaboni-Dutta’s Exploring and Producing Data for Business Decision
Making on Coursera. The Gies eBook provides a reading experience that covers all of the
information in the MOOC videos in a fully accessible format. The Gies eBook can be used with
any standards-based e-reading software supporting the ePUB 3.0 format.
Each Gies eBook is broken down by lessons that are navigable using our e-reader’s table of
contents feature. Within each lesson the following sequence of content will always occur:
Lesson title
A link to the web-based videos for each lesson (You must be online to view.)
Within the lesson, every time there is a slide change or a switch to the next informative video
scene, you will be presented with:
All Gies eBooks are designed with accessibility and usability as a priority. This design is intended
to serve all readers in a flexible manner regardless of their choice of digital reading tools.
If you have any questions or suggestions for improvement for this Gies eBook, please contact
Giesbooks@illinois.edu
4
Copyright © 2019 by Fataneh Taghaboni-Dutta
Published by the Gies College of Business at the University of Illinois at Urbana-Champaign, and
the Board of Trustees of the University of Illinois
5
Module 2 Exploring and Producing Data for
Business Decision Making
6
Lesson 2-1 Measures of Central Tendencies
Summarizing the data allows us to communicate about the data more effectively
1. Graphical methods
2. Numerical methods
Transcript
When you have a large data set, just looking at the data is not going to give you much insight. We can
summarize a data so that we can quickly communicate some major characteristics about the data. There are two
major ways to summarize a data, one is graphical method, some of which we discussed in the previous module.
Another is using numerical methods. An example of this would be the average value, for instance. In this lecture,
we will focus on numerical summarization.
7
Central Tendency - Slide 2
Transcript
The goal is not only to generate this but also learn how to effectively use these summaries. One very useful
characteristic is the central tendency of the data. Measures of central tendency tells us where is the central of the
data tends to be. Sometimes we think of this as the central tendency as a special value. However, as we will see,
not all measures of central tendency are necessarily typical values. We will focus on two measures, mean and
median.
Representation of population mean and sample mean. Here, population is every single person residing in Illinois
and sample is the subset of the population we are studying.
8
Transcript
The mean is simply the average, it is also known as the expected value. The population mean is represented the
Greek letter μ, while the sample mean is represented by the symbol x̄ and it represents a statistics which we will
use to estimate μ which is a population parameter. One thing that I wish to bring to your attention is the notation
that are used in statistics. When we want to show that we're using the population to calculate these parameters,
then we will use Greek notations in capital form. When we use sample data to generate statistics, then the
notation would be lower case and in English.
Representation of population and sample. Here, population is every single person residing in Illinois and sample
is the subset of the population we are studying.
N
∑ Xi
Population: Population mean (μ) - Parameter μ =
i=1
N
n
∑ xi
Sample: Sample mean (x̄) - Statistics x̄ =
i=1
Transcript
While in this class, I don't expect you to calculate to calculate anything manually, it is important to know how
these values are calculated when you are using any software. For that purpose, I will share with you the
equations for a basic understanding of how numbers are generated.
The first equation you see is how we calculate the mean for population represented by letter μ. The mean is
calculated by simply adding the value of the variable for each element of the population and then dividing it by
the size of the population which is the capital N. If you have data, from a sample study, the principle of the
communication method will stay the same, the notations are changed to represents the fact that data is collected
based on a sample. The second equation is for the data sets collected from a sample. In this case a variable for
each element of the sample is represented by lower case x and lower-case n represents the sample size. The
sample mean is calculated by adding all of the x's and then dividing that sum by the sample size, which is the
lower-case n. Then the sample mean is used as a point estimator for the population mean.
9
Example: Average Money Spent (1 of 2) - Slide 5
We want to measure the average money spent by customers who come to our website. Data is collected from 7
customers. What is the mean spending?
10
Example average money
spent - Slide 5
Customer $ spent (x)
1 $85.68
2 $67.21
3 $98.08
4 $34.78
5 $56.98
6 $27.93
7 $40.72
11
Transcript
We want to measure the average money spent by customers that come to our website. Data collected from
seven customers, what is the mean spending? So, here variable x represents dollar spent by the customer, for
example, x1 represents the money spent by the customer number 1 and x2 represents the money spent by
customer number 2 and so on.
12
Example average money
spent - Slide 6
Customer $ spent (x)
1 $85.68
2 $67.21
3 $98.08
4 $34.78
5 $56.98
6 $27.93
7 $40.72
13
7
∑ xi x1 +x2 +x3 +x4 +x5 +x6 +x7
i=1
x̄ = n
=
7
Transcript
The mean based on this sample is just the sum of the money spent by all seven customers divided by the
number of customers in our data, seven in this case. The samples statistics shows an average of $58.77 by the
customers.
The median Md is a value such that 50% of all measurements, after having been arranged in numerical order, lie
above (or below) it
If the number of measurements is odd, the median is the middlemost measurement in the ordering
If the number of measurements is even, the median is the average of the two middlemost measurements
in the ordering
Transcript
The median is the middle value when the data is sorted in either ascending or descending order. It is sometimes
represented by the capital M with a d subscript. There is no different between population and sample symbols in
this case. If the number of measurements is odd, the median is the middle most measurement in your ordering. If
the number or measurement is even, the median is the average of the two middle most measurements in the
ordering.
14
Example: Median for Money Spent - Slide 8
15
Median for money spent
Customer $ spent (x)
1 $85.68
2 $67.21
3 $98.08
4 $34.78
5 $56.98
6 $27.93
7 $40.72
16
Ordered
list
$ spent
$27.93
$34.78
$40.72
$56.98
$67.21
$85.68
$98.08
17
There is an arrow connecting both tables from the "Median for money spent" to the "Ordered list." In this last
table, the $ spent number $56.98 is highlighted. This means this value is the median.
Transcript
To find the median, first we sort the data in ascending order. The data point in the middle, in this case, $56.98 is
the median. So, 50% of the customer spent more than 56.98 and 50% spent less than 56.98.
18
Measure of
tendency
example 1 -
Slide 9
Example 1
57000
58000
59000
62000
64000
65000
66000
68000
71000
80000
19
Mean = $65,000
Median = $64,500
20
Measure of
tendency
example 2 -
Slide 9
Example 2
57000
58000
59000
62000
64000
65000
66000
68000
71000
80000
8,000,000
21
Mean = $786,363.64
Median = $65,000
Transcript
Now that we know how to calculate the mean and median, you may be asking yourself, so what? Calculating
these values is not just an intellectual exercise. There are times that one of these measures will be most useful.
Remember, we often think of this value as the typical value, let me give you an example, you're sitting next to
nine of your friends, so there are 10 of you in total, you all have graduated from the same college, same degree.
It is three years later you all have slightly different experiences, but you're making more or less the same salary
shown here in this table. The average salary income here is $65,000 and the median income since we have an
even number of data is average of the two values in the middle, those are the 64,000 and the 65,000, so your
median is 64,500. As you all are sitting there and talking, an old classmate walks in. This classmate who also
graduated with you had the same degree. Upon graduation, however, was drafted by the professional basketball
team and now his salary is $8 million. He joins your group. And now there are 11 of you. What happens to your
group's mean salary? Now the mean salary for your group is $786,000 and some.
How typical is that value for the group? What happens to the median value for the group? It stays at $65,000.
Which value represents the group expected income better? In this case the answer is median. Why? As you can
see, the mean is sensitive to the new primary salary, but not the median. So, if your data set has one or two
extreme values, we call these outliers in statistics. Then mean is less representative and median is more robust.
Have you ever looked at home prices were given market? If not, just pick a city or a neighborhood and see what
you get. What gets reported is the median price of the homes. That way a customer will know for sure if the
median is at the top of their price range, there are still 50% of the homes that are below that value. But if you rely
on the mean, the value might have been pulled to the low end because of few homes are dilapidated and they're
cheap or may look like too expensive because a few mansions that are very expensive.
We have looked at 5 possible sites for our new business. The monthly rentals are:
22
Annual rental for new
business
Location Annual Rental
A 84,000
B 78,000
C 114,000
D 103,200
E 93,600
23
What is the mean and median for this data set?
Transcript
So, now let's practice, we have looked at five possible sites for our new business, the monthly rentals are as
follows. Location A has an annual rent of 84,000, B annual rate of 78,000, C annual rate of 114,000, D annual
rent of 103,200 and E has annual rent of 93,600. What is the mean and the median for this data set?
We have looked at 5 possible sites for out new business. The monthly rentals are:
24
Annual rental for new
business - Solved
Location Annual Rental
B 78,000
A 84,000
E 93,600
D 103,200
C 114,000
25
x̄ = (84000 + 78000 + 114000 + 103200 + 93600) ÷ 5 = 472800 ÷ 5 = $94560
Mean is $94,560
Median is $93,600
Transcript
The mean is the average of the five numbers, which is 94,560 and the median is after according the numbers in
ascending order is at the location E with the annual rent of 93,600. And that is your median.
Symmetrical Curve
There is a histogram plot called "Distribution of number of ups", where the x-axis represents the number of ups
and the y-axis the frequency of the number of ups. The x-axis ranges from 15 to 65 and the y-axis ranges from 0
to 9000. In the histogram, there is a bell-shaped curve that goes from 25 to 65 in the x-axis. This is the normal
expected for the distribution of number of ups. The center of the curve, and also the tallest bar, is at 40 on the x-
axis with 8200 on the y-axis. The distribution is as follow:
26
Symmetrical curve example
Number of ups Frequency
30 1500
31 2000
35 6000
40 8200
45 4000
50 1000
27
Transcript
This graphical distribution shows number of stocks that increase their value at the end of trading day. This graph
is fairly symmetrical around its peak which means it has roughly the same mean and median. When we have
outliers, values that go far to the right side or left side of the shape of the curve will start getting skewed. Let me
show you what happens to the mean and the median as we start having skewness in our data.
The slide shows a histogram, where the x-axis ranges from 0 to 10 and the y-axis ranges from 0 to 1500. In the
histogram there is a bell-shaped curve that spreads from 3.5 to 7 in the x-axis. All the bars are blue except for the
center bar located at 5 in the x-axis, which has the top half labeled in green as the median and the half bottm
labeled as red as the mean. The values of the distribution are as follow:
28
Median/mean
histogram
example
x-axis y-axis
4 100
4.5 500
5 900
5.5 500
6 200
6.5 100
29
Transcript
So, here I have a graph that I have generated based on a data set. And at first what you see is a fairly
symmetrical histogram. So, in this case, the red bar represents the mean and the green bar represents the
median. And all the other observations will fall on either side of the mean and the median. Those are the blue
bars. As you can see, when we are fairly symmetrical, the mean and the median are on top of one another which
means they're about the same value. Now, let me show you what happens as I change the spinner button.
The slide shows the same bell-shaped histogram as Slide 13 Example Graph (1 of 4). Here, the cursor is
selecting the spinner button on the left top corner of the graph.
Transcript
30
The slide shows a right skewed histogram with a long right tail. The mean is in red and is located to the right of
the peak. The median is in green and is located to the right of the peak and three bars to the left of the mean.
This means that in this case the mean is smaller than the median.
Transcript
So, let me show you how this will change as I use my spinner butter here [laughter]. Let me show you how this
will change as I use the spinner button to change my data skewness. So, as I'm going down, you would see that
the data is pulling more and more to the left with a long – what we call are right tail. So, the right tail is the
skewness. So, what happened to the mean? As the right tail starts elongating, as it becomes skewed to the right,
mean starts to pull toward the right. This is similar to what happened when your friend who is a basketball player
joined your group, your mean went up but the median doesn't move as much.
The slide shows a left skewed histogram with a long left tail. The mean is in red and is located to the left of the
peak. The median is in green and is located to the left of the peak and two bars to the right of the mean. This
means that in this case the mean is smaller than the median.
Transcript
And the opposite will happen if I spin it the other way and start having a left tail. You would see that again; mean
and median start separating and then mean starts going toward the tail and the median stays a little further back.
So, the median is less likely to change as quickly as mean does.
31
Daily Temperature Champaign Excel Sheet (1 of 7) - Slide 17
The slide shows the "Average of daily low and high temperatures (in degrees Fahrenheit) for N.Y." data set. The
data contains a table with three columns. The first column is the day number, which is from 1 to 26770. The
second column represent the date, and the third column is the temperatures of New York in degrees Fahrenheit.
Download the Daily Temperature excel file (Refer to Data - First worksheet)
Transcript
In this video I will show you how to calculate mean and median using Excel. I'm using weather data for city of
New York. One of the questions often people ask about a city is its climate. And we can represent that by
summarizing the data and it's just talking about its mean or the median. What you see here is data that's been
collected for city of New Year for many, many years. Actually, if you look at it I have over 26,000 data record for
city of New York. So, if I want to take this mean manually, it's going to take me a long a time which doesn't make
any sense.
The slide shows two new cells called "mean" and "median" vertically placed to the right of the data set. This slide
intends to show how to use the excel average function to calculate the mean of the New York temperature data.
Once written =AVERAGE the program automatically unfolds four options which say average, averagea,
averageif, and averageifs. The first option (average) is selected in order to find out the mean of the data.
32
Transcript
So, instead I will use Excel to calculate its mean and also median. So, the mean is the average and as I type in
“av”, you can see that Excel will show you as many functions that starts with those letters and I don't really need
to type at all. Also, it would tell if you click on any of the functions that is showing you what that function is going
to return.
So, for example, the first one is average, it will return the average, the arithmetic mean of its arguments. And if
you look at the other ones it will tell you what the other functions can do. So, in our case, we're just going to use
average. We just want the arithmetic mean which means if we're doing these ourselves, we will take every single
day we had. We will add those temperatures and then divide it by the total number of days that we have. In this
case, I'm just going to press tab and then I have to give it into arguments. So, one option is to say C1, which is
the first place that I have a temperature then put a comma and then say C2 comma and so on. And again, that
takes way too long. Another option is for me to grab this and go down. Again, if I have 26,000 records, that's
going to take me a long time. So, instead what I'm going to do, click on my first cell where I have my data, hold
control, shift and press down arrow button.
After selecting Average as the function to be used, this slide shows how to pick the entire column of temperature
to compute the average. First select the first value cell of the New York column. When doing this, the average
value cell mentioned before changes to this first value. The next step is to hold Ctrl + Shift and press the down
arrow button. This will pick the entire data set.
Transcript
And it will automatically pick the entire data set that I have, the 26 – over 26,000 records. Close your parenthesis,
press enter and scroll back up.
33
Daily Temperature Champaign Excel Sheet (4 of 7) - Slide 20
The slide intends to review if the selected function matches the pertinent data. After selecting the entire column
of New York temperatures, the mean value is computed as 55.2 and is shown in the cell. By selecting it, an
equation of that cell shows in the command input field which is highlighted in yellow. The equation
(=AVERAGE(C7:C26776)) shows that the computed average value from cell C7 to cell C26776 are the correct
data to select.
Transcript
And what you would see if I click on this that is has picked the cell C7 to C264776. And the average temperature
turns out to be 55.2 degrees for the city of New York.
The slide intends to show how to use the excel median function by writing =MEDIAN in the cell next to the
"median" cell. Once written =MEDIAN the program automatically shows the syntax to get the medium value of
the whole data, which is “=MEDIAN(number1, number2, ...).” This function is used to find out the medium value
of the data.
34
Transcript
After writting the median function previously described, select the first cell of the temperature column and then
hold Ctrl + Shift and press the down arrow button. This allows to select the whole column to compute the median.
Transcript
The slide shows the outcome of the median function when selecting the New York column, which is 55.9.
Download the Daily Temperature excel file (Refer to Mean_Median - Second worksheet)
35
Transcript
Now, I may want to know what is the median. Remember, median is that center point where 50% of the time, the
temperature is above that value and 50% of the time the temperature is below that value. And by comparing the
mean and the median, we should be able to tell if we have a symmetrical curve or we have we a skewed curve.
Remember, if mean and median are about the same, then we have a symmetrical curve. So, let's see what
happens for the city of New York. So median will show up, I'll pick my first cell, control shift, down, close the
parenthesis, return and you can just scroll backup. And you would see that it's 55.9. So, it seems to be fairly
close. It seems to be – mean and median seems to be very close. We will look at this data in more detail in a
later video.
36
Lesson 2-2 Measures of Dispersion
Range
Variance
Standard deviation
Transcript
In the previous lesson, we learned how to calculate mean and median which represents the middle point of our
data. We also mentioned that we often think this summary statistics as a typical value. For example, when you
take your car for an oil change and are told that the average time to get this done is 30 minutes, do you expect to
be done in 20 minutes, 30 minutes, 40 minutes? How would you access your chances of taking your car in and
out within 30 minutes? This is where some measure of how dispersed your data is can be helpful. Measures of
dispersion also known as variation tell us how speed out or compact the data tends to be. There are three main
measures of variation. The range, the variance and the standard deviation. In this lecture, we will cover these
three basic measures of dispersion.
37
Measure of Dispersion (Variation) (2 of 2) - Slide 25
Transcript
The range is simply the largest observation minus the smallest observation. This can explain quickly how
widespread your data is.
38
Which ER does a better job?
Emergency Room (ER) Waiting time
A 5 minutes
B 5 minutes
39
The slide shows two histograms: "Time to process in ER A" and "Time to process in ER B." For both histograms
the x-axis is the waiting time in minutes and the y-axis is the frequency. In the "Time to process in ER A"
histogram the x-axis ranges from 1 to 7 and the y-axis ranges from 0 to 15. Here, there are three bars: the first
bar is for waiting time 4 min with a frequency of 8, the second bar is for waiting time 5 min with a frequency of 14,
and the third bar is for waiting time 6 min with a frequency of 6.
In the "Time to process in ER B" histogram the x-axis ranges from 1 to 9 and the y-axis ranges from 0 to 12.
Here, there are seven bars centered at 5 in the x-axis. The first bar is for waiting time 2 min with a frequency of 2,
the second bar is for waiting time 3 min with a frequency of 3, the third bar is for waiting time 4 min with a
frequency if 5, the fourth bar is for waiting time 5 min with a frequency of 11, the fifth bar is for waiting time 6 min
with a frequency of 5, the sixth bar is for waiting time 7 min with a frequency of 3, and the seventh bar is for
waiting time 8 min with a frequency of 2.
Transcript
Consider two emergency rooms. One measure of ER operation effectiveness is the time it takes to process the
patients. For these two ERs, we have taken some observations and find the average time waiting before
admission by patients is about the same, 5 minutes. But in which of these two ERs would the patient experience
a waiting time closer to this average? To answer this question, we need to study the variability in our data. And
look at the data from ER A, shows that the patients waited as little as 4 minutes and as much as 6 minutes. So,
the range of the data was 2 minutes. For ER B, the range is 6 minutes, that is 8 minutes minus the 2 minutes.
So, when comparing these two ERs, we can say while the average is the same for both, the patients using ER B
experience more variability. And for them, the average is less typical compare to the experiences of patients who
go to ER A. While range is one measure of dispersion, it's not very accurate, especially when the data set gets
large. The more commonly used and more meaningful measures of how variable a data set is, is the standard
deviation and variance.
Standard deviation is a measure of dispersion data. Small values of standard deviation will mean that the
data points are all close to the mean.
Standard deviation is calculated by taking the square root of the variance.
40
Transcript
Standard deviation is a measure of dispersion of data. Small values of standard deviation will mean that the data
points are close to the mean. Standard deviation is calculated by taking the square root of the variance. In
addition to expressing variability of the data, the standard deviation is used in more advanced topics of statistics
which we will cover later on in this course.
Variance - Slide 28
N 2 2 2 2
∑ ( xi −μ ) ( x1 −μ ) + ( x2 −μ ) +... + ( xN −μ )
Population of Size N: σ 2 =
i=1
N
=
N
n 2 2 2 2
∑ ( xi −x̄ ) ( x1 −x̄ ) + ( x2 −x̄ ) +...+ ( xn −x̄ )
Sample of size n: s2
i=1
= =
n−1 n−1
Transcript
The variance is computed differently for population data than it is for sample data. For population data, each
observation is subtracted from the population mean and then resulting value squared. Once all values are
computed, the numbers are summed and then divided by the population size, upper case N. The resulting value
is represented by the Greek letter σ with the squared symbol after it. This is called σ squared. For sample data,
each observation is subtracted from the sample mean and then resulting value squared. Once all these values
are computed, the numbers are summed and then divided not by the sample size but rather by n − 1. The
resulting value is represented by the English letter s squared. The sample variance s squared is a point estimate
for the population variance σ squared.
41
Standard Deviation - Slide 29
Transcript
Regardless of how the variance is computed, the standard deviation is always simply the positive square root of
the variance. The population standard deviation is the square root of the population variance and it's represented
by the Greek letter σ. The sample standard deviation is the square root of the sample variance and it's
represented by the English letter s. Recall what I said about notations earlier, population parameters are always
in Greek and capital letters, and notation for sample statistics is in lower-case and English.
Example - Slide 30
42
Measure of
tendency
example 1 -
Slide 30
Example 1
57000
58000
59000
62000
64000
65000
66000
68000
71000
80000
43
Mean = $65,000
Median = $64,500
44
Measure of
tendency
example 2 -
Slide 30
Example 2
57000
58000
59000
62000
64000
65000
66000
68000
71000
80000
8000000
45
Mean = $786,363.64
Median = $65,000
Transcript
As I have mentioned, we will not be calculating these values manually and instead we will use a software like
Excel because our focus will be on large data sets and not small and rather trivial problems. However, I would
like to show you to an example how one would calculate this. The goal is to understand what the functions in the
software you will use is doing. Then we can focus on their meanings. Recall the example we used for calculating
mean and median. Now, let's use it for calculating variance and standard deviation. In this example, we will treat
this group of friends as a sample of student population who graduated with them from their college. In the first
case, there are 10 friends, that's your small n, and the mean salary is 65,000 and that is x bar.
46
Measure of
tendency
example 1 -
Slide 31
Example 1
57000
58000
59000
62000
64000
65000
66000
68000
71000
80000
47
Mean = $65,000
Median = $64,500
n = 10, x̄ = 65,000
n 2 10 2
∑ ( xi −x̄ ) ∑ ( xi −65000 )
i=1 i=1
2
s = =
n−1 10−1
2 2 2 2
( 57000−65000 ) + ( 58000−65000 ) + ( 59000−65000 ) +...+ ( 80000−65000 )
2
s = = 47,777,778
10−1
s = √47,777,778 = 6,912.147
Transcript
So, now, we can put each individual data points in x of i. The first data point is 57,000, and second is 58,000, and
so on, with the last one being 80,000. We squared each difference so that we – when we sum the difference, the
positives and negatives won't cancel each other out. Then divide the sum total by n−1 which is 9 in this case. By
taking the square root of the variance, we will find the standard deviation and in this case, this is approximately
$6,912.
48
Measure of
tendency
example 2 -
Slide 32
Example 2
57000
58000
59000
62000
64000
65000
66000
68000
71000
80000
8000000
49
Mean = $786,363.64
Median = $65,000
n = 11, x̄ = 786,363.64
n 2 10 2
∑ ( xi −x̄ ) ∑ ( xi −786,363.64 )
i=1 i=1
2
s = =
n−1 11−1
2 2 2
( 57000−786,363.64 ) + ( 58000−786,363.64 ) +...+ ( 80000−786,363.64 )
2
s = = 5,724,063,454,545.46
11−1
s = √5,724,063,454,545.46 = 2,392,501.51
Transcript
What happens to the standard deviation when the next person with the $8 million salary joins the group? Now,
there are 11 friends, that's your n, and the mean salary now is $786,363, and that is your x̄. So, now, we can put
each individual data point in x of i, and just like before, calculate the variance. By taking the square root of
variance, we will find the standard deviation. And in this case, this is more than $2 million which is much larger
than the standard deviation of 6,912. So, if you only reported the average salary for this case, people might think
that the value is a typical value for this set of friends. But if you look at the variability of the salaries, you would
know that the reported mean salary is far from typical value for the group members. Remember to watch the
video illustrations where I show you how to use Excel to calculate these statistics.
Given the two histograms – which sample would have a mean that is a better representation of what one might
observe if a random observation from the sample was selected and why?
The slide shows two bell-shaped histograms named A and B. For both histograms, the x-axis is the value that
ranges from 0 to 10 and the y-axis is the frequency that ranges from 0 to 12. In histogram A the highest bar is at
5 in the x-axis and 11 in the y-axis. In histogram B the highest bar is at 5 in the x-axis and 7 in the y-axis.
Histogram A looks slimmer than histogram B. These are the values:
50
Histogram A
Value Frequency
0 1
1 3
2 5
3 7
4 9
5 11
6 10
7 8
8 6
9 4
10 2
51
Histogram B
Value Frequency
0 2
1 3
2 4
3 5
4 6
5 7
6 6
7 5
8 4
9 3
10 2
52
Transcript
So, now, let's practice. Given the two histograms, which would have a mean that is a better representation of
what one might observe if a random observation from the sample was selected, and why?
Given the two histograms – which sample would have a mean that is a better representation of what one might
observe if a random observation from the sample was selected and why?
The slide shows the same two bell-shaped histograms as Slide 33 Let's Practice. In addition, there is a red
vertical line in each histogram representing the mean, which is about 5.2.
Transcript
The answer is A. Both histograms show central tendencies near value of 5 and both have observations with
values between 0 and 10. However, histogram A has more observations clustered around its center as compared
histogram B which has observations that are spread out. This means that the distribution observation for sample
A has a smaller standard deviation than sample B. To use averages to pass information about the data set
without expressing the variability within the data set which is most often expressed by standard deviation is very
incomplete. Knowing the average as a single summary point for an entire data set has little value if the average
is not a very good representative for our data. So, always pay attention to both the central tendency of the data
as well as the standard deviation when looking at a summarize data.
53
Daily Temperature Champaign Excel Sheet (1 of 4) - Slide 35
The slide shows the same information as Slide 23 Daily Temperature Champaign Excel Sheet (7 of 7). To the left
is the "Average of daily low and high temperatures (in degrees Fahrenheit) for N.Y." data set. To the right, are the
mean and median cells with their respective values (55.2 for the mean and 55.9 for the median).
Download the Daily Temperature excel file (Refer to Mean_Median - Second worksheet)
Transcript
As we discussed in our lessons, just talking about mean or median or the central tendency of a data set as a way
of summarizing data is very incomplete. If we have a lot of variability in our data, then the mean and the median
is not as good of a representative as they could be otherwise. So, to know whether or not our data is very
dispersed or not is to find its standard deviation. It is the most complete form of finding variability in the data.
The slide shows a new cell called "Std dev" placed below the median cell. The slide intends to show how to use
the Excel standard deviation function by writing "=ST" in the cell next to the "Std dev" cell. Once written =ST, the
program automatically unfolds eight options which say: standardize, stdev.p, stdev.s, and stdvepa, among others.
The third option (stdev.s) is selected in order to find out the standard deviation of the data.
54
Transcript
So, in Excel, this is relatively simple to do. So here if I'll look at now standard deviation, I can put here equal and
then you can start typing “st”. And again, as before you would see many different functions that start with s and t
shows up as a way of you getting it quickly. If you look at the second one you would see that it says it calculates
the standard deviation based on the entire population. Now I have a large data set here for many, many years for
New York City but it is not the population, I have a sample. So, what I need to use is actually the standard
deviation dot S which is an estimation based on a sample of data that we have. And there's some mathematical
differences in the denominator of how standard deviation is calculated when you use population versus sample.
And these two values can actually get very close to one another as your data set increases. So anyway, here I'm
going to pick standard deviation which is STDEV.S. I'm going to tab and it will pick it for me.
After selecting the standard deviation function, the slide intends to show how to select the entire data of the New
York column. First select the first value cell of the New York column, and then hold Ctrl + Shift and press the
down arrow button. This will pick the entire dataset.
Transcript
55
Daily Temperature Champaign Excel Sheet (4 of 4) - Slide 38
The slide shows the outcome of the standard deviation function, which is 17.37.
Download the Daily Temperature excel file (Refer to Mean_Median_SD - Third worksheet)
Transcript
And again, it's asking for the arguments. And the best way to do that once again is control shift down, close the
parenthesis, press return. And if I scroll up, you would see that it says it's 17 degrees. So, what does this say? If
temperature from New York is a normal distribution, then the temperature that you'll expect to see in New York,
68% of the time, is going to be 55 degrees plus or minus 17 degrees, so 68% of the time of that. 95% of the time
it's going to be 55 plus or minus two standard deviations. So that is basically telling you how varied of a
temperature you're going to have. So, we use the standard deviation to know the variability of temperatures in
this case that you would notice in the city of New York.
56
Lesson 2-3 Percentiles and Z-Score
Percentiles - Slide 39
Percentiles represent the approximate percentage of values in a data set that are below a certain value.
Transcript
Sometimes we are interested in relative position of value in relationship to other values in a data sets, and we
can use percentiles for this. Percentiles represent the approximate percentage of values in a data set that are
below a certain value. For example, you may be interested in taking a new job, and you're offered a salary of
$100,000. How do you know if the firm is giving you a good offer? I'm pretty sure that you will be happier with this
number if you know that your offer puts you at the 95th percentile, which would mean that you're getting paid
better than 95% of others with the similar job titles. You may not be as excited if you knew your salary is at 25%,
which means 75% of people get higher pay than you. So, percentiles very quickly will inform you about the
position of the value you are studying.
57
How to Calculate Percentiles (1 of 2) - Slide 40
Transcript
How do we calculate percentiles? We already know something about this. If you recall, we talked about median,
which is the value where 50 percent of the data points are higher than that and 50% of the data points are less
than that. So, median is 50th percentile. To find the median, we order the data set in order, and point in the
middle was the median or the 50th percentile.
Transcript
So, to find any percentile, we do the same. We order the data set for a given percentile, let's say 10%. The data
point that has 10% of the observation below it is the 10th percentile.
58
Percentiles Example - Slide 42
59
Percentile
example
Salary data
$115,472
$105,845
$105,582
$102,551
$98,188
$94,220
$91,380
$89,828
$89,697
$89,519
60
The table shows a list of salary data in descending order. The 60th percentile is between $102,551 and $98,188,
the 50th percentile is between $98,188 and $94,220, and the 10th percentile is between $89,697 and $89,519.
Transcript
Consider this example where we have data for ten individual salaries for people with similar job titles as you were
getting hired for. I have already ordered the ten data points in ascending order here. The 10th percentile is
between these two values, where one value is less than it, which in this case is the 10th percent of the value, and
nine values are more than it, which is 90% of the values. The 50th percentile is here. And your salary of
$100,000 will fall here, which is close to 60th percentile. There are different ways of calculating percentiles, but
all methods will give you results that are close. For now, I would like you to understand what percentiles are
telling us, which is the position of a given value in relationship to other values in data set.
Z-score - Slide 43
Z-score represents how many standard deviations an observation is from the mean.
x−μ
Z =
σ
x: Variable of interest
μ: Mean
σ: Standard deviation
Transcript
When we have large sets of data, then we have another way of finding the position of a specific observation. This
is known as the z-score. By using the z-score, we can find the proportion of data points that are less than a
specific value. The z-score is calculated using this formula. Here, x is the variable of interest, μ is the mean for
the population, and σ is the standard deviation. For example, if you were graduating from a college with a degree
in business, you get a job offer and would like to know how your offer stacks up against others. All other peers
from all colleges with job offers similar to yours. If all you knew is the median, then you can tell if you are in the
top 50 percent or the lower 50 percent. But are you at 51 percent or 95 percent? We can calculate the position of
your offer by calculating its z-score.
61
Z-score Example (1 of 2) - Slide 44
Transcript
Now let's expand on this example we were using before. You interview for a job and get an offer and would like to
know how competitive is your offered salary. All you have is some data based on published reports on websites.
Let's look at such website. I'm using payscale.com. Searching for a job title such as business analyst gives the
following information. The median is $54,030, and the standard deviation, about $8,900. You get an offer of
$60,000. This is above the median, but how does the salary stack up compared to others?
You get an offer of $65,000 – this is about the median, but how does this salary stack up compared to others?
62
Transcript
To know where this salary falls, you can determine its percentile. To do this, we need to first find out how many
standard deviations the offered salary is from the mean or the median. This is known as the z-score, and it's
calculated by taking x minus the mean and dividing it by the standard deviation. For you, the x is the salary
offered to you of $65,000. An average salary is $54,030, standard deviation, $8,600. This means you received an
offer which is 1.27 standard deviations above the mean. How can we tell where we stand in comparison to the
rest of the people getting such offer? Is the salary in the top 70%? Top 99%? What exactly could it be?
The slide shows a symmetric bell-shaped curve with a mean of 2000. The y-axis is the density, which ranges
from 0 to 0.004, and the x-axis is the value, which ranges from 1600 to 2400. The peak of the curve is at 0.004 in
the y-axis. The area under the curve has been split into three parts. The central part, which is one standard
deviation from the mean is labeled in red with a 68% of the data. The second part is two standard deviations from
the mean, both at the left and right of the red part, and are labeled in green with a 95% of the data. The third part
is three standard deviations from the mean, both at the left and right of the green part, and are labeled in blue
with a 99.7% of the data.
Transcript
Now let me share with you some very useful properties of a bell curve. For large data sets, we often observe that
many values cluster around the mean or the median. So that if we create a histogram of the data, we get a
distribution that represents a bell shape, a symmetrical curve. When this happens, according to the empirical
rule, 68% of all observations will fall within one standard deviation of the mean, 95% will fall within two standard
deviation, and 99.7% of all observation will fall within three standard deviations. Knowing this rule of one
standard deviation, two standard deviation, three standard deviation, known as the empirical rule, will always be
a handy way of figuring out where an observation of interest falls in comparison to the mean or the median.
63
Z-score Example (1 of 2) - Slide 47
You get an offer of $65,000 – this is above the median, but how does this salary stack up compared to others?
This slide shows the same bell-shaped histogram as Slide 46 Empirical Rule. In addition, there is a vertical line at
the 1.27 right standard deviation from the mean with the label +1.27.
Transcript
So, given a specific observation, the offer we received, and knowing something about this type of job starting
salaries, we were able to find our relative place in the group. The z-score, when positive, tells you that the value
is above the mean, and when it's negative, it's below the mean. Furthermore, it tells you how many standard
deviations are you above or below the mean. So, the salary offer is between one and two standard deviations
above the mean, roughly around here. So certainly, this salary receives somewhere in the 70th and 95th
percentile, not bad. In later lesson, we will learn to know the exact value, but for now, we can use the empirical
rule as a rough estimate. And this can be very important because many times you are in a meeting where you
are being shown a lot of statistics, and if you want to use your judgment to evaluate the validity of what is being
recommended or ask follow-up questions and don't have access to a computer or calculator, using this
understanding about being one, two, three standard deviations away from the mean and the probabilities can
provide you with a quick insight.
64
Z-score Example (2 of 2) - Slide 48
You get an offer of $65,000 – this is above the median, but how does this salary stack up compared to others?
In this slide there is an image of a bell-shaped curve for the annual salary date. In the curve, the first 10% of the
salary histogram is at $43,406, the 25% of the salary histogram is at $48,469, the median is $54,030, the 75% of
the salary histogram is at $61,346. and the 90% of the salary histogram is at $68,008. This last value is
highlighted.
Transcript
The graph you see here is from payscale.com for the salary of business analyst. It is showing the values for the
10th percentile, 25th percentile, mean, 75th percentile, and the 90th percentile. Looks as though the offered
salary for you is closer to the 90th percentile. Calculating the z-value of an observation and knowing about the
empirical rule helps us determine what percent of the data is below the value we are examining.
65
The slide shows a picture of a car dealership with a bell-shaped curve inside it. The left side of the curve is
labeled as "Great price", the peak of the curve is labeled as "Good price", and the right side of the curve is
labeled as "Above market." The price estimate for the TrueCar is $21,125.
Transcript
TrueCar commercial addresses the question of how do you know if you are paying a fair value for a car you're
about to purchase. In their commercial, they do this by showing a visual of the range of prices people in your
area have paid for the same car. The mean is at the good market value, and the tales of the curve show below
and above the market value. Of course, you would be happy about your purchase if your purchase price is falling
in the left of the curve, below the mean and toward the good buy.
The slide shows a car value histogram overlapped with a bell-shaped curve. The true car value is highlighted in
the picture as $27,513. The area to the left of this value ($27,513) is highlighted in yellow as "Exceptional price
less than $27,545." The average paid is $29,510. The area between $27,545 and $29,272 is labeled as "Great
price less than $29,272." The factory invoice is $30,494. The area after this value ($30,494) is labeled as "Above
market $30,505 or more." The rightmost point of the curve is the MSRP which is $32,904 and has a price
certainty of 97.69%.
Transcript
As an example, I have asked for a price of a new Camry XLE V6 near my office area, where I'm likely to go
shopping. This is what I get. They have data on 201 sales in my area for the type of the car I'm interested in. The
bar graph is the actual histogram for price ranges. Then, a normal curve is super-imposed on this. The mean of
the curve appears to be at the factory invoice of $30,494. Looking at this, most people are paying less than this
value. Maybe this is to make sure that everyone leaves the lot happy about the bargain they received. But the
actual sold price data shows average selling price to be around $29,000. Furthermore, you see the use of
empirical rule without being mentioned, of course, to create the categories of good price, great price, above
market, exceptional value. I'm showing you this graph to illustrate the fact that often times in business, the
absolute values, such as the price you paid for your car, has a lot less meaning than how did the price you paid
for a car stack up against others. This is about the relative position of the value of interest. Again, here that would
be the price you paid for you care in the entire data set available. When your purchase price is about two
standard deviations below the mean, then you have paid exceptional bargain compared to the rest of the people
who purchased the same type of car.
66
Let's Practice - Slide 51
The slide shows an image of the fuel economy and environment website with information about the mile per
gallon for a vehicle. In fuel economy, the MPG for the vehicle is 26 with 3.8 gallons per 100 miles. The range for
being a small SUV is from 16 to 32 MPG. This information is highlighted in the slide. Besides, the annual fuel
cost is $2,150 and the save of fuel cost over 5 years is $1,850 compared to the average new vehicle.
Transcript
Staying with the car example, the car you buy has this sticker on its window. This is a fuel economy label. Every
new car sold in the United States is required to have a fuel economy label, which shows the miles per gallon
estimates. This is meant to help consumers compare and shop for vehicles. So, if the best in this category is 32
miles per gallon, that is the 99th percentile, and the worst mileage per gallon is 16, then how does the car you're
considering to buy with 26 miles per gallon stack up compared to the rest of the vehicles in this class? This car
has an average of 26 miles per gallon, where the range is between 16 and 32.
The slide shows the same information presented in Slide 51 Let's Practice. Here, the information is given to
calculate the mean, standard deviation and z-score.
67
std dev.: (32 − 16) ÷ 6 = 2.67
Transcript
So, the midpoint is a rough point estimate of the average for the car in this category, and that's about 24 miles
per gallon. We can also get a rough estimate for standard deviation based on the empirical rule. That is, 99.75%
of all observations will be within plus or minus three standard deviation. Thus, a rough estimate for the standard
deviation is the distance between 32 and 16 divided by 6. Remember, we have three standard deviations on both
side of the mean for total of 6. Using this, we will get standard deviation of 2.67 miles per gallon. So, this
particular car that you're considering is above average and is roughly .75 standard deviations above the mean.
The slide shows the density plot for standard normal distribution. The area of standard deviation equal to one
from the mean is highlighted in red which represents 68% of the total area. The rest of the area is highlighted in
blue and represents 16% of the total area on both sides of the curve. In addition, there is a vertical line at 0.75
with the label of z = 0.75.
Transcript
Now, remember that 68% of observations fall within one standard deviation, shown here in red. The remaining
32% of the observations are distributed symmetrically outside of the plus and minus one standard deviation,
shown in blue. So, a very rough estimate would be that the car we are considering to buy is close to an 80th
percentile. It is one standard deviation higher than it would have been 68% plus 16 would have given you 84
percentile, so I went a little less than that since we are only .75 standard deviation from the mean. So, as you can
see, knowing the z-score can help us determine how to place a particular observation within a population. To
most of us, when making a decision, knowing this placement, called, sometimes in everyday language it's
percentile, is much more meaningful than the actual value, such as miles per gallon, as well as in this case.
68
Media Player for Video
Transcript
I talked about this example in my video lecture where you are getting an offer for a job which is $65,000 and you
want to know how well your salary compares to others that are receiving similar offer from other employers. So,
you go to salary.com and you find that the median for this job is listed at $54,030 with a standard deviation of
$8,600.
You get an offer of $65,000 – this is above the median, but how does this salary stack up compared to others?
69
Z = (65,000 − 54,030) ÷ 8,600 = 1.27
Transcript
So, in order for you to know where you fall, I talked about the fact that can come up of your z-score. And we
calculated z-score for your data to be 1.27. So, your 1.27 to the right the mean and then finally I said that looking
at empirical rule, we can basically found out if you're at 68%, if you're one standard deviation away but you're
more than one standard deviation away, you're 1.27 standard deviation away.
Graph 1 - Slide 56
The slide shows a bell-shaped curve, where the x-axis is the value that ranges from 1600 to 2400, and the y-axis
is the density that ranges from 0 to 0.004. The mean of the curve is 2000. In addition, the area of one standard
deviation from the mean is highlighted in red and represents 68% of the total area.
Transcript
Graph 2 - Slide 57
70
The slide shows a bell-shaped curve, where the x-axis is the value that ranges from 1600 to 2400, and the y-axis
is the density that ranges from 0 to 0.004. The mean of the curve is 2000. In addition, the area of two standard
deviations from the mean is highlighted in green and represents the 95% of the total area.
Transcript
So, you're somewhere between 1 and 2 standard deviation away. So, you're trying to estimate where you're
falling. If you were 2 standard deviations away, you will be at the top 95% but you're at 1.27.
The slide shows a new Excel sheet where it intends to know how to return the normal distribution for the
specified mean and standard deviation. Once written =NOR, in cell D1, the program automatically unfolds eight
options which say: NORM.DIST, NORM.S.DIST, and NORM.S.INV, among others. The first option (NORM.DIST)
is selected in order to find out the normal distribution of the data.
Transcript
So, in Excel, we can use the normal distribution function to find out exactly what percentage of the people will be
below you. So, what percentile you are represents what percentage of the population would have a salary offer
which is less than what you are receiving. So, to use the normal distribution, I can just start writing norm and you
can see again extra stuff to guess what I might want. And it has basically four functions that it returns. The first
one is NORM.DIST. This one returns the normal distribution for specified mean and standard deviation. If you
use inverse, I can give the probability and it would tell where it is – where the value is. And then there are two
functions that have the dot S in middle. When we have dot S, NORM.S or dot DIST or NORM.S.INVERSE, it
means that is going to use the standard normal distribution. And remember the difference here is that in a
standard normal distribution, the mean is set as zero and the standard deviation is set at one. And I will
demonstrate this over and over again so you should get more comfortable with it. For now, we're going to use the
NORM.DIST.
71
Excel Sheet (2 of 8) - Slide 59
After choosing =NORM.DIST as the function to apply, the cursor changes and becomes cross-shaped, which
indicates that data should be selected for the function. The syntax of the function denotes four parameters: x,
distribution mean, standard deviation, and whether the distribution is cumulative or not.
Transcript
The slide shows the =NORM.DIST function with some input fields filled out. Here, the x value is 65000, the mean
is 54030, and the standard deviation is 8600. The last needed value, cumulative, is waiting to be specified. In the
syntax described in the previous slide, the cumulative portion of the function unfold two options: "True -
cumulative distribution function" and "False - probability mass function." The cursor is selecting the "True" option.
Transcript
72
Excel Sheet (4 of 8) - Slide 61
The slide shows the complete =NORM.DIST function, where the x value is 65000, the mean is 54030, the
standard deviation is 8600, and the cumulative is 1.
Transcript
If I tab on it, you will see that it would show me the arguments. The first argument it asks for is that what is your
x? X is your random variable, the salary offer. And in your case specifically speaking is 65,000. Then, it's the
mean of the distribution. So, according to salary.com, the mean of this job offer was 54,030 and it had a standard
deviation of 8,600. And the last argument asks if you want cumulative or not. So, in our case we are always
looking for cumulative distribution. You're not ever going to use probability mass function in this class. So, you
can either type in true or what I like to do is type in one. One is always translated to true, zero is translated to
false. So, our last argument is just one. Press parentheses and return. And what it says is that you are at the 89,
0.89. So, you're almost at the 90th percentile, your salary is at the 90th percentile. Not bad, right?
The slide shows the outcome of the =NORM.DIST function, which is 0.898948. The screenshot also shows a
bell-shaped curve with a mean of 54030 and the first 90% area shaded. The normal distribution corresponds to
the exact proportion of the shaded area.
73
Transcript
So, specifically what it says that if I could draw it for you and a normal distribution and just assume that this is
symmetrical, I'm not very good at drawing, the average here is at 54,030. This is based what we got out of
salary.com. This shape of this distribution is being controlled by that 8,600-standard deviation. And if you can
roughly think about it, this is one standard deviation away on either side, two standard deviation on either side
and three standard deviations on other side and you're somewhere at 1.27, so I would say roughly here, so your
salary puts you above all these other people. So, this is basically about 90%. Okay, so you're at the 90th
percentile.
The slide shows a new cell that uses the function =NORM.S.INV. Once written =NORM.S.INV the program
automatically shows the syntax to get this value, which is "=NORM.S.INV(probability)."
Transcript
74
The slide intends to show how to select the input of the =NORM.S.INV function. Here, the normal distribution cell
(0.898948) is selected.
Transcript
The slide shows the outcome of the =NORM.S.INV function, which is 1.275581.
Transcript
If I wanted to know what the z was, the best way to do it is by just saying is norm. Again – and this time I'm going
to use dot S because the z of 1.27 is about how many standard deviations you're away based on a standard
normal table. So, if I take this, it only has one argument. It says, what is the probability you're looking for? Well,
the probability I'm looking for is sitting right here. So, I click on it and I say return. And it returns 1.27 which is
about what we had calculated using the equation. So, in this case really at this point we can use Excel directly to
come up with the answer that is of most interest to us. And in this case, it happens to be 89%, 0.89 or 90%.
75
Lesson 2-4 Discrete and Continuous Random Variables
Transcript
In statistics, you often hear the term random variable. Random variable really just represents the answer to the
question you're asking. For example, how much will my stock value change in a year. This value, which is the
answer, is a random variable. In statistics we use x to represent the random variable. Now, why is it random?
Because the process is random. We don't fully understand why stock prices go up or down by the amounts that
they do. Thus, the answer to question polls is random.
76
Types of Random Variables (1 of 2) - Slide 67
A discrete random variable is one which may take on only a finite number of distinct values.
Transcript
There are two types of random variables. A discrete random variable is one which may take only a finite number
of distinct values. For example, number of children in a family, number of people taking this course, number of
customers who rated the service as satisfactory.
A continuous random variable is one which takes on an infinite number of possible values in a defined interval.
77
Transcript
The second type of random variable is a continuous random variable, which is a variable that takes on an infinite
number of possible values. Continuous random variable is not defined as a specific value. Instead, it's defined
over an interval of values. For example, weight of a soda can. In reality, if you weighted each soda can and
recorded its true value, you will always get varying numbers. However, you may expect that the actual weight of
a can to be 11.5 ounces to 12.5 ounces. Life of a light bulb is another example. If we record the exact timing we
get, many possible values will come out. But one might say that the life of a properly functioning light bulb is
between 100 to 1,000 hours. What separates continuous random variables from discrete ones is that they're
unaccountably infinite. They have too many possibilities to lift or to count and they can be measured with a high
level of precision.
Transcript
So, now let's practice. Identify the following random variables as discrete or continuous. Daily return on a stock,
number of customers waiting in line, time spent waiting to talk to a customer service agent, number of calories in
a chocolate bar.
78
Let's Practice - Solved - Slide 70
Transcript
Daily return on a stock is a continuous variable. Number of customers waiting in line is a discrete variable. Time
spent waiting to talk to a customer service agent is a continuous variable. Number of calories in a chocolate bar
is a continuous variable, which may get recorded as discrete. And that's something that often occurs because we
may not be interested in absolute accuracy and thus round the numbers.
For a discrete random variable, the probability distribution will be a list of probabilities associated with its
possible value.
79
Probability distributions - Slide 71
Number of siblings No. of respondents
0 3
1 6
2 5
3 4
4 2
80
Transcript
For a discrete random variable, the probability distribution will be a list of probabilities associated with its possible
values. We actually did this when we learned about histograms and relative frequencies. What we were
displaying was the probability distribution. For example, let's assume we ask 20 people how many siblings they
have. So, here the random variable is number of siblings. This is what we recorded. The 20 people we talked to
had zero to four siblings, thus these are the possible outcomes for our random variable. The second column
shows how many responded with each possible outcome. For example, three people had zero siblings and six
people had one and so on. Just as we did in constructing a histogram, we can develop the relative frequency of
each possible answer or outcome. And this is what we get.
X: Number of Siblings
81
Probability distributions -
Slide 72
Outcomes Probability
0 3÷20 = 0.15
1 6÷20 = 0.30
2 5÷20 = 0.25
3 4÷20 = 0.20
4 2÷20 = 0.10
82
The slide shows a bar graph with the corresponding values of the table. The x-axis is the number of siblings that
ranges from 0 to 4, and the y-axis is the probability that ranges from 0 to 0.35. In the graph, there are five bars
that show the table outcomes.
Transcript
As shown here, the probability of distribution of discrete random variable is a table, graph or formula that gives
the probability associated with each possible value that the variable can assume.
83
Probability distributions -
Slide 73
Outcomes Probability
0 3÷20 = 0.15
1 6÷20 = 0.30
2 5÷20 = 0.25
3 4÷20 = 0.20
4 2÷20 = 0.10
84
X: Number of Siblings
P(X = 0) = 0.15
∑ p(x) = 1
all x
Transcript
Random variable is represented by x and associated probability by p sub x. For example, in our example of 20
people, if you select one person at random, the probability of that person having no sibling is 15%, or probability
of finding a person who has three siblings is 20%. The property of discrete property distribution is that probability
of an outcome is greater than or equal to zero. And the probability of all possible outcomes some to one.
Cumulative distribution function is a function with probability that the random variable X is less than or equal
to x.
P(X ≤ x) = ?
P(X ≤ 2) = ?
85
Probability distributions -
Slide 74
Outcomes Probability
0 3÷20 = 0.15
1 6÷20 = 0.30
2 5÷20 = 0.25
3 4÷20 = 0.20
4 2÷20 = 0.10
86
Outcomes 0, 1 and 2 with their respective probabilities are highlighted.
Transcript
All random variables discrete and continuous have a cumulative distribution function, which shows the probability
that the random variable x is less than or equal to some value. We denote this by the small x for every value of x.
For instance, the probability of picking someone and that person having two or less siblings then is written as
probability of x less than or equal to two. We calculate the cumulative distribution function for discrete and the
variables by summing up the probabilities. Adding 0.15 to 0.30 to 0.25, which gives you 0.70.
87
Cumulative distribution
Number of Siblings Probability: P(X = x) Cumulative Probability: P(X ≤ x)
0 0.15 0.15
1 0.30 0.45
2 0.25 0.70
3 0.20 0.90
4 0.10 1.00
88
The slide shows a cumulative probability plot in the shape of stairs. The x-axis is the number of siblings that
ranges from 0 to 4, and the y-axis is the probability that ranges from 0 to 1. As the number of siblings increases,
the probability also increases. For example, the probability of having less than 1 sibling is 0.45 and the probability
of having less than 2 siblings is 0.7.
Transcript
In this table, the third column represents the cumulative probability of the random variable number of siblings
being less than or equal to some particular value. Again, denoted by small x. In our sample of 20, the probability
that the person we pick will have four or less sibling is one. The probability histogram for the continuous
distribution of this random variable will be like this. As you can see, the probability of person in our sample to
have four or less sibling is 100 percent.
E(X) = μ = x1 p(x1 ) + x2 p(x2 ) + ... + xk p(xk ) or: E(X) = μ = ∑
all x
xp(x)
89
Expected value
Outcomes Probability
0 3÷20 = 0.15
1 6÷20 = 0.30
2 5÷20 = 0.25
3 4÷20 = 0.20
4 2÷20 = 0.10
90
Outcomes 1 and 2 with their respective probabilities are highlighted.
Transcript
How we calculate the mean, also known as the expected value of the discrete random variable x is shown here.
In our example, that would be 1.8. Please take a moment and see where the numbers are coming from. 1.8
siblings is the value expected to occur in the long run and on average. For instance, if we thought that the 20
people whom we talked to were a good representation of a population, then we expect that on average the
members of this population will have 1.8 siblings, which for this kind of a data it kind of makes sense. By looking
at the probabilities, we see that 50% of respondents have either one or two siblings.
A bank manager wants to get a sense of how many people are waiting to see a teller and conducts a study and
finds the following pattern:
91
Waiting time of customers
Number of Customers Waiting in Line No. of Times Observed
0 3
1 10
2 8
3 5
4 3
5 2
6 1
92
What is the expected number of customers waiting, and what is the probability of finding 4 or more customers in
line?
Transcript
So, now let's practice. A bank manager wants to get a sense of how many people are waiting to see a teller and
conducts a study and finds the following pattern shown here in this table. For example, in the study where they
took random observations, three times they had no one, zero customers in line. And 10 times there was one
customer waiting and etcetera. Based on this data, what is the expected number of customers waiting? And what
is the probability of finding four or more customers in line?
93
Waiting time of customers - Solved
Number of Customers Waiting in Line No. of Times Observed Probability
0 3 3÷92 = 0.094
1 10 10÷32 = 0.313
2 8 0.25
3 5 0.156
4 3 0.094
5 2 0.063
6 1 0.031
94
P (x ≥ 4) = P (x = 4) + P (x = 5) + P (x = 6)
Transcript
To answer this question, we first need to calculate the probability of finding x number of customers in line. Based
on our total of 32 observations, three times in line was empty. And thus, probability of finding an empty line is
about 0.094. The probability of finding of one customers in line is 0.0313 and so on. Once we have the
probabilities, then the mean or the expected number of customers in line is simply the sum of each value of x –
that's the number of customers in line – times the probability, which is in this case will be this formula, which will
give you the answer of 2.16 customers on average in line. And the probability of finding four or more customers
in line is just finding the probability of four plus probability of five plus probability of six, which will give you 0.94
plus 0.63 plus 0.31 or 0.188. There is a 0.188 chance of finding four or more customers in line.
2 2 2
σ = √(x1 − μ) p(x1 )+(x2 − μ) p(x2 )+... + (xk − μ) p(xk )
The slide shows the bar graph provided in Slide 72 Probability Distributions (2 of 3). The x-axis is the number of
siblings that ranges from 0 to 4, and the y-axis is the probability that ranges from 0 to 0.35. In the graph, for 0
siblings the probability is 0.15, for 1 sibling the probability is 0.3, for 2 siblings the probability is 0.25, for 3 siblings
the probability is 0.2, and for 4 siblings the probability is 0.1.
Transcript
95
Standard Deviation (2 of 2) - Slide 80
2 2 2
σ = √(x1 − μ) p(x1 )+(x2 − μ) p(x2 )+... + (xk − μ) p(xk )
2 2 2 2 2
σ = √(0 − 1.8) × 0.15 + (1 − 1.8) × 0.30 + (2 − 1.8) × 0.25 + (3 − 1.8) × 0.20 + (4 − 1.8) × 1.0
σ = √1.46 = 1.21
96
Standard deviation
Outcomes Probability
0 3÷20 = 0.15
1 6÷20 = 0.30
2 5÷20 = 0.25
3 4÷20 = 0.20
4 2÷20 = 0.10
97
Transcript
As before, the mean or the expected value is the central tendency of the distribution. This may or may not be a
very typical value. One way of knowing how typical the mean is, is to examine the variability in the distribution for
a discrete variable. Such as this example – the standard deviation is calculated by taking the difference of each
possible outcomes and the mean, squaring that difference and then multiplying it by the probability that that
particular outcome occurs. We will do this for all possibility and take the square root of the overall sum to get the
standard deviation. So, in our example, where the mean was 1.8, then the standard deviation will be 1.21
siblings. So, now that we have gone through this example, you may be asking yourself, why do I need to know
this?
Transcript
First, let me reiterate that the probability distribution for a discrete random variable is just the table, which links
each outcome with its probability of occurrence. In most of the cases, when there are numerous possible
outcomes, we may not want to develop this table. Luckily, there are many well-known probability distribution and
once we realize that our particular study or experiment belongs to one of these well-known distribution, then we
will use the proper equations to calculate the probability of a particular outcome. In this course, we will not cover
this distribution. But since our focus in this course is to focus on large data sets and taking samples from that, we
will learn about normal distribution for calculating the probabilities. And how this distribution will be used for both
discrete, as well as continuous variables.
98
Excel Sheet (1 of 14) - Slide 82
The slide shows the Expected values data set. The data set contains a table with two columns: the daily demand,
which ranges from 1 to 20, and the occurrence of demand. As an example, the daily demand of 20 has an
occurence of 12, which means that there are in total 12 days when the product has a daily demand of 20.
On the excel sheet, adjacent to the table the following text is displayed:
The manager of a convenience store wants to understand the demand for one of the products she sells, so that
she can make a better decision on how many to stock everyday. She has the following data gathered over
several weeks. Based on this data, what is the value of expected demand and its standard deviation?
Download the Expected Value excel file (Refer to Problem statement - First worksheet)
Transcript
In this video, I'm going to show you how to use Excel to calculate the expected value and standard deviation
from discrete random variables and this is the example I'm using. the manager of a store wants to understand the
demand for one of the products she sells so that she can make a better decision on how many to stock every
day. She has the following data gathered over several weeks. Based on this data what is the value of expected
demand and standard deviation? So, what we have is that the table that shows the demand has gone between 1
to 20 and we can see over the time that she studied how many times demand for this item was just only once.
So, 3 times it was only once, 10 times it was 2 and so on. So, let me go to the next tab where I only have the
data and show you how you will do it.
99
Excel Sheet (2 of 14) - Slide 83
The slide shows a new column called "Probability" placed next to the Ocurrences column. To the right of the
table, a cell called "Sum" is created. The slide shows how to use the Excel sum function by writing =SUM in the
cell next to the "Sum" cell. Once written =SUM the program automatically shows the syntax to get the sum of the
Occurrence column values, which is “=SUM(number1, number2, ...).” The cursor is selecting the first value cell of
the Occurrence column. The slide intends to show how to select the whole column by holding Ctrl + Shift + the
down arrow.
Transcript
First thing I need to do is find the probability of each. So, I'm going to create here order for that. And in order for
me to answer this question, I have to know how many times daily demand of one occurred over this total period.
So, what I need to find out is first of all what is the sum of all the occurrences? The sum of all occurrences is just
sum of the column right here. So, I'm going to hold control shift, parentheses and close it and we have total of
140 different types of occurrences, right?
The slide shows the outcome of the sum function, which is 140. Here, the slide intends to start filling out the
probability column. To calculate the probability, use the first value cell of the Occurrences column and divide it by
the sum value. Press F4 to select both cells at the same time. The function for the first probability value is =B2/I1,
where B2 is 3 (First value of the Occurrences column) and I1 is 140 (sum value).
100
Transcript
So, then this is simply 3 see out of 140 was daily demand of 1. So, I can B2 divided by 140. Now this 140 is
going to be a value that I will use for every type of daily demands. So, I'm going to use the same thing in the
denominator when the daily demand is 2 and daily demand is 3 and so on. To make it easy for me to copy and
paste this 140 is not going to change. So, what I can do is that we press function F4 and you would see that the
dollar signs are appearing between each value of I and 1. That means that that cell is going to be locked.
The slide shows the outcome of the first probability value, which is 0.0214. This means that for a daily demand of
1, the occurrences are 3 with a probability of 0.0214. Additionally, the slide intends to show an easier way to fill
out the probability column instead of filling it manually. The way to do it is to select the first created cell and drag
down the fill hand with a ‘+’ sign until reaching the desired value. You will find that Excel fills out the cells
automatically.
Transcript
So, when I press return and I copy this down, you will see that in the next one again B3 in this one, B3 is being
divided by the same number 140 so 140 is going to stay the same as I copy this.
101
Excel Sheet (5 of 14) - Slide 86
The slide shows the second probability value with the formula =B3/$I$1, where B3 is 10 (second value of
occurences for daily demand), and I1 is 140 (Total occurences). This last value is locked in the formula. The
cursor is showing an easier way to fill out the probability column instead of filling it manually. The way to do it is
to select this second cell and drag down the fill hand with a ‘+’ sign until reaching the desired value. You will find
that Excel fills out the cells automatically.
Transcript
Transcript
So now that I have that the easiest way to copy it is to put your cursor in the corner as you see that this plus sign
appears just double click and it will fill the whole thing for you.
102
Excel Sheet (7 of 14) - Slide 88
The slide shows the bottom of the Expected values data set. In addition, the screenshot shows the sum function
at the end of the probability column. To use this function write =SUM above the Probability column. Once written
=SUM the program automatically shows the syntax to get the sum of the Probability column values, which is
“=SUM(number1, number2, ...).” The intention is to check if the sum of all the probabilities is equal to one by
selecting the whole probability column.
Transcript
You want to make double sure that that you have done nothing wrong. So, remember all probability should sum
up to 1. So, if I sum this column, then I should get a value of 1. Otherwise, I have left something out, right? So,
I'm going to double check and I pick the entire column and sure enough it adds up to 1. So, I have accounted for
all the probabilities.
The slide intends to show how to use the Excel sumproduct function by writing =SUMPRODUCT in a cell to the
right of the probability column. Once written =SUMPRODUCT, the program automatically shows the syntax to
sum the products of two or more columns, which is “=SUMPRODUCT(array 1, [array 2], [array 3]...).” In this
case, the Daily Demand and Probability columns are selected and the sumproduct function appears as
"=SUMPRODUCT(A2:A21, C2:C21)."
103
Transcript
The slide shows a new cell called "Expected value (mean)" placed a few cells below the sum cell. Here, the
expected value formula is shown: E(X) = μ = x1 p(x1 ) + x2 p(x2 ) + ... + xk p(xk ).
Transcript
Now that I have done this the expected value is simply the daily demand if multiplied by the probability of
occurrence summed up to the next one multiplied by its probability and so one. So, what we want to do is I take
the values from this column and one by one multiply them by their corresponding value that they see in the
probability and then some as we go along. This is something we have in Excel done for us automatically. It is
known as the sum product. So, some product takes two array and in this case our array is our daily demand so
I'm going to pick all the values that show up in our daily demands and scroll back up and the second array is the
probability of each of these daily demands occurring. So, I will do the same thing close the parentheses and
return. So, therefore, our expected value, the mean demand, is 11.49, but this is only part of the answer.
104
The slide shows the outcome of the expected value, which is 11.49. In addition, a new column called "(x-mu)^2"
is created and placed next to the Probability column. Here, the slide intends to calculate the standard deviation
2 2 2
by using the following formula: σ = √(x1 − μ) × p(x1 )+(x2 − μ) × p(x2 )+... + (xk − μ) × p(xk )
Besides, the slide also intends to start filling out the (x-mu)^2 column. To calculate the (x-mu)^2, use the first
value cell of the Daily Demand column and subtract it from the Expected Value (mean). The function for the first
(x-mu)^2 value is =(A2-$I$4), where A2 is 1 (First value of the Daily demand column) and $I$4 is 11.49 (mean).
Transcript
What is the variability of demand that we experience? For that we need to calculate the standard deviation. This
is the equation for the standard deviation. So, what I need for the standard deviation is first of all calculate what is
X each, each daily demand occurrence minus μ squared and then multiplying it by its probability. Let me just by
creating a column here that I will call X minus μ raised to the power of 2. So, this would be basically for each
column, it would be this value minus the expected value or the mean and this is something that I'm going to use
as I go along. So, I'm going to lock the location might by pressing F4 and then raise this to the power of 2.
The slide shows the outcome of the first (x-mu)^2 value, which is 109.9502041. In addition, the slide intends to
show how to calculate the standard deviation for the rest of the (x-mu)^2 column. The second (x-mu)^2 value has
the formula =(A3-$I$4)^2, where A3 is 2 (Second daily demand) and I4 is 11.49 (mean). This last value is locked
in the formula. The cursor is showing an easier way to fill out the (x-mu)^2 column instead of filling it manually.
The way to do it is to select this second cell and drag down the fill hand with a ‘+’ sign until reaching the desired
value. You will find that Excel fills out the cells automatically.
Transcript
So, if I now drag this down, you will see what happens. In the next one, location of A3 has become updated, but
I4 where is the μ has remained locked so it's going to be repeated. So now that I know this is right, I can put my
cursor here when I see the cross hair double click.
105
Excel Sheet (12 of 14) - Slide 93
The slide shows the (x-mu)^2 column filled. In addition, a new cell called "Standard deviation" is created and
placed below the Expected value cell. The slide intends to show how to input a set of functions for calculating the
standard deviation. Next to the "Standard deviation" cell, the function "=SQRT(sumproduct(" is written and ready
to be filled by using the syntax “=SQRT(sumproduct(array 1, [array 2], [array 3]...).”
Transcript
So now I can calculate my standard deviation. And once again this is going to be a sum product of which
columns? X minus μ times the probability and then square root of the whole thing. So I'm going to say square
root of the sum products that array 1 is X minus μ, and I'm following the formula, and array two are the
probabilities.
The slide shows how the =SQRT(sumproduct function is filled out by selecting the Probability and (x-mu)^2
columns. This cell has the formula "=SQRT(sumproduct(D2:D21, C2:C21), where D2:D21 is the (x-mu)^2 data
and C2:C21 is the Probability data.
106
Transcript
The slide shows the outcome of the Standard deviation function, which is 6.205. In addition, one standard
deviation is between 17.69 (expected value + standard deviation) and 5.28 (expected value − standard
deviation).
Download the Expected Value excel file (Refer to Analysis - Second worksheet)
Transcript
So here we go, these are my square root of sum products of values in this column multiplied by this column and
summed up as we've gone along, exactly what we need here. So, when I return, I see that the standard deviation
is 6.205. So, if I think about one standard deviation away so what is that? One standard deviation away is this
class 6,025 and this value minus. So, 68% of the time the demand is actually between something between 17
and 5 roughly. So, while it is good to know average, standard deviation will tell you what is the variability that you
will notice in that average on a long-term basis.
107
Lesson 2-5 Normal Distribution
A continuous random variable is not defined at specific values. Instead, it is defined over an interval of
values, and is represented by the area under a curve.
The probability of observing any single value is equal to 0, since the number of values which may be
assumed by the random variable is infinite.
Transcript
Continuous random variables are usually measurements and they take on infinite number of values. For
example, the exact temperature outside. This type of a random variable is not defined at specific values. Instead,
it's defined over an interval of values, and is represented by area under a curve. In advanced mathematics, this
is known as an integral. The probability of observing any single value is equal to zero, since the number of values
which may be assumed by the random variable is infinite.
108
Properties of Continuous Probability Distribution - Slide 97
Variable “X,” could take any value over an interval of real numbers,
X: Water Temperature [0,100]
Then the probability function p(x) must satisfy the following: P(x) ≥ 0 for all x and the total area under the
curve is equal to 1.
Transcript
Variable X, which could take on any value over an interval of real numbers. Let's consider water temperature.
Water freezes at zero degrees Celsius and boils at 100 degrees Celsius. So, if water temperature is what we are
recording, then X, our random variable, is water temperature. And it can take on any value between zero and
100. Then the probability function, p sub-x must satisfy the following. P sub-x for all values of X in that interval is
greater than or equal to zero. In other words, no value has a negative possibility. And the total area under the
curve is equal to one, which means collectively all possibilities are accounted for, and they sum up to one.
109
The slide shows a histogram, where the x-axis ranges from 0 to 20. In the histogram there is a bell-shaped curve
of normal distribution that spreads from 0 to 20, with a mean of 10.
Transcript
Normal distribution is one of the best-known distributions used for continuous random variables. But as you will
see as we go through this course, is that the normal distribution applies to more than continuous variables. Given
the right conditions, we can use this distribution to approximate for distributions used for discrete variables as
well. That is very important to learn about this distribution and learn how to use it. Normal distribution is a bell-
shaped curve, which is defined by its central point and its standard deviation, sigma.
The slide shows two normal distribution density plots, where the x-axis is the mean and the y-axis is the density.
For the normal distribution curve on the left, the x-axis ranges from 460 to 540, and the y-axis ranges from 0 to
0.030. Here, the plot is concentrated in the center of 500 (i.e. mean= 500) and decreases on either side. For the
normal distribution curve on the right, the x-axis ranges from 264 to 336, and the y-axis ranges from 0 to 0.030.
This plot is concentrated in the center of 300 (i.e. mean=300) and decreases on either side. The two plots have
the same standard deviation of 15 which means they have same width and height.
Transcript
The mean, which is also the median, represents the center point of the distribution and the standard deviation
controls the shape. Assume you have the following distribution. Where the mean is 500 and the standard
deviation is 15. This curve is most peaked at the mean of 500. Now if we have a distribution with the same
standard deviation of 15, but mean of 300, the curve will look like this. The shape of the two curves looks exactly
alike, because they both have the same standard deviation. The second curve, however, peaks at 300, its mean.
Now imagine another normal curve. This time the mean is the same value of 500, but the standard deviation is 5.
110
The Normal Distribution (3 of 3) - Slide 100
The slide shows two normal distribution density plots, where the x-axis is the mean and the y-axis is the density.
For the normal distribution curve on the left, the x-axis ranges from 460 to 540, and the y-axis ranges from 0 to
0.030. Here, the plot is concentrated in the center of 500 (i.e. mean= 500) and decreases on either side. For the
normal distribution curve on the right, the x-axis ranges from 460 to 540, and the y-axis ranges from 0 to 0.08.
This plot is also concentrated in the center of 500 (i.e. mean=500) and decreases on either side. The two plots
have different standard deviations. The left graph's standard deviation (sd) is 15 and the right one is 5. Since the
sd of the second graph is smaller, the second plot is higher and slimmer.
Transcript
As you can see, both curves peak at 500, but the second curve spread is a lot less than the first curve. So, the
larger the standard deviation, the more spread out the values of the distribution are, as is the case for this curve.
Likewise, the smaller the standard deviation will mean the values are less spread out, in which case the mean is
a much better representation of what one might observe.
The curve is symmetrical about its mean (which is also its median).
The tails of the normal extend to infinity in both directions.
The area under the normal curve to the right of the mean equals the area under the normal to the left of
the mean, and the area under each half is 0.5.
111
There is a bell-shaped curve spreading from 976 to 1024 in the x-axis and from 0 to 0.04 in the y-axis. The curve
has a symmetric normal distribution centered at the mean of 1000, which is the same as the median. Both the left
and right side area of the mean is 50%.
Transcript
Other properties of normal curve are that the curve is symmetrical by this mean. That means the left and the right
side are mirror images of one another. Both tails of the distribution extend to infinity, getting closer and closer to
the horizontal axis, but never touching it. The left side goes to the negative infinity, while the right side goes to the
positive infinity. In spite of its infinite width, the area under the curve is one. Since normal curve is symmetrical,
the area under the normal curve to the right of the mean equals the area under the curve to the left of the mean.
And each of these equals 50%. Together they make 100% or one.
Given a normal distribution with mean of 1000 and standard deviation of 10, we want to know: What is the
probability of random variable X to be:
P(995 ≤ X ≤ 1005) = ?
Transcript
Imagine we have a normal distribution with the mean of 1000 and standard deviation of a 10. We want to know
what is the probability of this random variable X to be more than or equal to 990 or less than or equal to 1009. To
answer this question, let's first see what we are asking in a visual form.
112
Standard Normal Curve (2 of 5) - Slide 103
Given a normal distribution with mean of 1000 and standard deviation of 10, we want to know: What is the
probability of random variable X to be:
P(995 ≤ X ≤ 1005) = ?
The slide shows a bell-shaped curve spreding from 0 to 1032 in the x-axis and from 0 to 0.04 in the y-axis. The
curve has a symmetric normal distribution centered at the mean of 1000, and the highest point in the y-axis is at
0.04 density. The area under the curve from 995 to 1005 is filled in blue.
Transcript
The answer to our question is the blue-shaded area under this normal curve. To answer this mathematically, we
first find the equation for the curve and then take the integral under the curve between the two end points of 990
and 1009, and this is definitely one way to solve for the question asked, but there is another and more
expeditious way of solving for this question. And that is by understanding a special form of a normal curve known
as the standard normal curve.
Mean (μ) = 0
113
Standard Deviation (σ) = 1
The normal random variable of a standard normal distribution is called a standard score or a z-score.
Norm.Dist(X,Mean,SD,1)
The slide shows a standard normal distribution curve spreading from −3 to 3 in the x-axis and from 0.0 to 0.4 in
the y-axis. The y-axis is the density and the mean is 0.
Transcript
The standard normal distribution is a special case of normal distribution, which has a mean of zero and standard
deviation of one. Also, that normal random variable of a standard normal distribution is called a standard score or
most commonly known as the z-score. Z-score tells us how many standard deviations the value of interest is
above or below the average. We saw z-score in an earlier lesson and it's calculated by taking the value of
interest minus average of all values divided by the standard deviation of all values. Now why bother with this?
Well we have calculated the area under the standard curve for all possible values of z. There are normal tables
to look at and all statistical software have these values figured out.
So then for any given normal curve, all we need to do is to convert it to a standard curve instead of doing
calculations individually for each value of mean and standard deviation. The function, Norm.Dist in Excel will
return the z-value based on a standard normal distribution. Make sure to watch the Excel illustration videos to
learn about this.
The slide shows two bell-shaped curves with symmetrical normal distribution. To the left, is the curve presented
in Slide 103 Standard Normal Curve (2 of 5). This curve spreads from 0 to 1032 in the x-axis and from 0 to 0.04
in the y-axis. It has a mean of 1000 and the highest point in the y-axis is at 0.04 density. The area under the
curve from 995 to 1005 is filled in blue. Here is some additional information:
P(995 ≤ X ≤ 1005) = ?
To the right, is the curve presented in Slide 104 Standard Normal Curve (3 of 5). This curve spreads from −3 to 3
in the x-axis and from 0.0 to 0.4 in the y-axis. It has a mean of 0 and the area under the curve from −0.5 to 0.5 is
filled in blue. Here is some additional information:
114
X = 995, z = (x − μ) ÷ σ = (995 − 1000) ÷ 10 = −0.5
In the slide there is also an arrow called standarized that goes from the left to the right curve.
Transcript
Mean = 1000
Standard Deviation = 10
The slide shows the bell-shaped curve with symmetrical normal distribution shown in Slide 104 Standard Normal
Curve (3 of 5). This curve spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the y-axis. It has a mean of 0
and the area under the curve from −0.5 to 0.5 is filled in blue. Here is some additional information:
Transcript
Now back to our example. With the mean of a 1000, standard deviation of 10, we want to know what is the
probability of a random variable to be between 995 and 1005, we can convert the values to z-score. Again, this is
called standardizing. For 995, z is calculated by subtracting 995 from 1000 and dividing the difference by 10,
which is negative .5, and for 1005, z is just .5. So basically, asking what is the probability of x between 995 and
1005 is the same as asking what is the probability of x being between negative .5 and positive .5. We can answer
this translation a lot faster by either looking up the information in a normal table or using a software like Excel,
which will give us .3829. In other words, for a normally distributing population, which is a mean of a 1000 and
standard deviation of a 10, the probability of observing a value between 995 and 1005 is 38.29%. Again, to learn
how to use Excel to find these values, please watch the Excel tutorial videos.
115
Standard Deviation - Slide 107
The slide shows three standard normal distribution density plots with a mean of 0. The first curve has 68% area
filled with blue from −1 to 1, which represents one standard deviation from the mean. The second curve has
95.45% area filled with blue from −2 to 2, which represents two standard deviations from the mean. The third
curve has 99.73% area filled with blue from −3 to 3, which represents three standard deviations from the mean.
Transcript
For most data sets, the majority of observations clump around the average with a number of observations
decreasing the farther values are from the average in either direction. Standard deviation is the most common
measure of variation and tells us how the whole collection of values varies. For normal distribution, we expect to
see about 68% of a population to be within one standard deviation from the mean, 95% of observation falls within
two standard deviations from the mean, and 99.7% fall within three standard deviations of the mean. Observation
outside of the three standard deviations are considered rare and are called outliers. So now let's practice.
The slide shows two normal distribution density plots: Curve A and Curve B. Both curves have the same mean of
50, which means both plots are centered at 50. Curve A spreads from 10 to 90 with a highest point at 0.025 in
116
the y-axis. Curve B spreads from 25 to 75 with a highest point at 0.05 in y-axis, which makes it taller and
slimmer.
Transcript
We have these two normal curves centered around the value of 50. Can you approximate the standard deviation
of curve A and curve B?
The slide shows the two plots shown in Slide 108 Let's Practice. In addition, there is a calculation about how far
the tail is from the mean. Curve A has the width of 95 minus 50 with three standard deviations from the mean.
This means it is calculated as (95 − 50 ) ÷ 3, which is equal to 15. Curve B has the width of 75 minus 50 with
three standard deviations from the mean. In this case, it is calculated as (75 − 50) ÷ 3, which is approximately 8.
Transcript
Look at curve A. the curve seems to go from mean to about 95 before the tail becomes very thin. So, the value
95 is roughly around three standard deviations. So that would be a width of 95 minus 50, which is 45, and if that
is about three standard deviations, then the standard deviation must be around 15. Now look at curve B. Same
approximation applies. The curve seems to get thin at about 75. So, take 75 minus 50 and divide it by three. This
distribution has a standard deviation of about eight.
117
Normal Distribution as a Model - Slide 110
Transcript
While perfectly symmetrical curves may not exist in real world applications, we often do encounter phenomenon
in the real world which follows at least a near normal distribution. This allows researchers to use normal
distribution as a model for assessing probabilities associated with these real-world phenomena. Furthermore, as
you will learn later in the course, the normal curve allows us to use sample information in order to get
understanding about the population. Normal curve also can be used as a great approximation for some discrete
random variables and distribution. For this reason, it is the most important probability distribution in the field of
statistics.
Mean = 1000
Standard Deviation = 10
118
The slide shows the same information as Slide 106 Standard Normal Curve (5 of 5). There is a bell-shaped curve
with symmetrical normal distribution. The curve spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the y-
axis. It has a mean of 0 and the area under the curve from −0.5 to 0.5 is filled in blue. Here is some additional
information:
Transcript
In Lesson 5, I showed you this slide where I found the probability of x between 1,005 and 995 to be .03829, when
we have a distribution which has a mean of 1,000 and a standard deviation of 10. I asked you to watch the video
in Excel to know exactly how I came up with this number. So, this is what I'm going to show you today.
The slide shows a new Excel sheet intending to know how to calculate the probability of a normal distribution. To
the left, there are two statistical inputs vertically shown: Mean and Standard deviation. These inputs have a value
of 1000 and 10, respectively. In addition, a couple cells below the Mean and Standard deviation columns, there is
a cell that says "p(995 ≤ x ≤ 1005)."
Download the Normal Distribution excel file (Refer to Mean_SD_P - First worksheet)
Transcript
Here we have mean of 1,000, standard deviation of 10, and we are asking for probability of x being between 995
and between 1,005. One of the things that is very helpful at the beginning when you're trying to learn how to do
the normal distribution is to draw it.
119
Normal Distribution Excel Sheet (2 of 11) - Slide 113
The slide shows a sketch of a bell-shaped distribution curve with the area between 995 and 1005 shaded.
Transcript
The slide shows the curve with two arrows: one coming from the x-axis of 1005 and the other one coming from
995. Both arrows are pointing to the left.
120
Transcript
So, I'm going to show you what this looks like in picture, so you can visualize it, and that way you would
understand what the software is doing for you much better. So, if I can visualize this curve, and forgive me if this
is not perfect, because I'm not very good in drawing, but it would be a symmetrical curve, normal curve, that
looks like this, right? And it would have a center, a mean, that is at 1,000. And we are talking about what is the
probability of it being between 995 and 1,005? And if I can draw this in terms of probability and see the shaded
area, this is what we are talking about. So, this is 1,005. And this is 995. The way Excel works is that if I give it
the value of X to be 1,005, it's going to give me everything to the left of this. Same thing will happen if I give it
995. It will give me everything to the left of this. So, as you can see, if I take this larger value, and subtract from it,
this smaller value, I should be able to get the value. So, let's just do that, and see what is the function that I will
use.
The slide shows a new probability cell below the p(995 ≤ x ≤ 1005) cell that says "p(x ≤ 1005)." This new
probability cell is not computed yet.
Transcript
121
Normal Distribution Excel Sheet (5 of 11) - Slide 116
The slide shows a closeup of the cell next to the p(x ≤ 1005) cell where the normal distribution function will be
calculated. Once written =NOR, the program automatically unfolds eight options that say: NORM.DIST,
NORM.INV, NORM.S.DIST, and NORM.S.INV, among others. The first option (NORM.DIST) is selected in order
to compute the probability.
Transcript
The slide shows the syntax of the =NORM.DIST function, which is "=NORM.DIST(x, mean, standard_dev,
cumulative). The values used are: the x value is 1005, the mean is 1000, and the standard deviation is 10. For
the cumulative value two options unfold: "True - cumulative distribution function" and "False - probability mass
function." The cursor is selecting the "True" option, which is represented by a #1 inside the function.
122
Transcript
So first, I'm going to look at the probability of x being less than or equal 1,005. That is the larger area. And the
function I'm going to first show you is the simple function of norm.dist, you can see, there are four functions that
come up. That starts with the word norm. These are all for normal distribution. Here, I'm going to only focus on
normal.dist, so I'm going to tap on that. And you would see that the first argument it asks for is x. X here is 1,005,
that is your variable of interest. Mean here is 1,000. Standard deviation is 10, and the last argument asks, if
you're looking for cumulative distribution function. And the answer to that is yes. True. So, you can either enter
true, or you can put value of 1, 1 gets translated to true, 0 gets translated to false. You will actually not have any
occasions in this class, where you have to calculate the probability mass function as it's called by Excel. So, you
always have the last argument for you set at 1. So, put 1 there, close the parentheses, and press return. And
what you get is .6914, so what is that? That is basically this shaded area in yellow, 69%, right? Because
everything to the left of the mean is 50%. We are a little bit to the right of the mean, so it has to be more
probability. So, we have found .69 be the yellow shaded area. So now what I'm going to find, this area. So that is
probability of x being less than or equal to 995.
The slide shows the outcome of the normal distribution function where x is ≤ to 1005, which is 0.6915. In
addition, the bell-shaped distribution curve has two shaded areas: in yellow is the area between 995 and 1005,
and in green is the area to the left of 995.
Transcript
123
Normal Distribution Excel Sheet (8 of 11) - Slide 119
The slide shows a new probability cell below the p(x ≤ 1005) cell that says "p(x ≤ 995)." To the right of this new
cell the normal distribution function will be calculated. Once written =NORM.DIST the program automatically
shows the syntax of this function, which is “=NORM.DIST(x, mean, standard_dev, cumulative).” Here, the x value
is 995, the mean is 1000, the standard deviation is 10, and the cumulative is 1 (True).
Transcript
The slide shows the outcome of the normal distribution function where x is ≤ to 995, which is 0.30853.
124
Transcript
So, I will repeat the same thing. What is the probability of x? Less than, and you can see, Excel tries to be
helpful, you don't need to use it, but it is there. So it's going to be norm.dist, and again, I will pick this. And the
only thing that changes now is that now my x is 995, mean is 1,000, standard deviation is 10, and the last
argument is 1. So, there we go. The green area is .30, so the shaded area in yellow right now, what I'm looking
for, is going to be the difference of the larger area minus the smaller area, .38292, which is exactly what I showed
you in our PowerPoint presentation.
The slide shows the process of calculating the probability of x ≥ to 995 and ≤ to 1005. Here, the formula used is
"=B7 − B8", where B7 is 0.6914 (Probability of x ≤ to 1005), and B8 is 0.3085 (Probability of x ≤ to 995).
Transcript
125
The slide shows the outcome of the normal distribution function where x is ≥ to 995 and ≤ to 1005, which is
0.38292.
Download the Normal Distribution excel file (Refer to Norm_DIST - Second worksheet)
Transcript
Mean = 1000
Standard Deviation = 10
The slide shows the same information as Slide 111 Standard Normal Curve. There is a bell-shaped curve with
symmetrical normal distribution. The curve spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the y-axis. It
has a mean of 0 and the area under the curve from −0.5 to 0.5 is filled in blue. Here is some additional
information:
Transcript
In the next video, I'm going to show you how to do it, if we were to follow these steps verbatim. Now, the only
reason I'm showing you is because I want you to know how the functions work. In real life, obviously you pick
whatever that is best for you. I like to use what I showed you just now, because it's the fastest way of coming up
with it.
126
Standard Normal Curve (1 of 2) - Slide 124
Mean = 1000
Standard Deviation = 10
The slide shows the same information as Slide 123 Standard Normal Curve. There is a bell-shaped curve with
symmetrical normal distribution. The curve spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the y-axis. It
has a mean of 0 and the area under the curve from −0.5 to 0.5 is filled in blue. Here is some additional
information:
On the right hand bottom of the slide, is the function NORM.S.DIST for calculating the standard normal
distribution. This function is highlighted.
Transcript
Going back to this example from lesson five in module two and I'm going to show you how to use Excel to do this
process that we have shown in the slide. This will introduce you to a new function – norm.s.dist. So, what we
wanted to know was if we are having a mean of a 1,000 and a standard deviation of 10 for a normal distribution,
what would be the chances of observing something at random that falls within 1,005 and 995. In this slide, I first
solve for the z, which means I standardize my normal distribution that was at the 1,000 mean and standard
deviation of 10 to the standard normal distribution, which has a mean of 0 and standard deviation of 1.
127
Standard Normal Curve (2 of 2) - Slide 125
Mean = 1000
Standard Deviation = 10
The slide shows the same information as Slide 124 Standard Normal Curve (1 of 2). There is a bell-shaped curve
with symmetrical normal distribution. The curve spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the y-
axis. It has a mean of 0 and the area under the curve from −0.5 to 0.5 is filled in blue. Here is some additional
information:
In this slide, the equations for calculating the z-score are highlighted.
Transcript
And that means using this equation. And that resulted in 1,005 being z of 0.5 and negative 0.5 for 995.
128
Excel Sheet (1 of 9) - Slide 126
To the left of the slide are two statistical inputs vertically shown: Mean and Standard deviation. These inputs have
a value of 1000 and 10, respectively. Below these values are two columns: x and z. Column x has the values
1005 and 995, and column z has the values 0.5 and −0.5. To the right, is a screenshot of Slide 125 Standard
Normal Curve (2 of 2). In the middle, a bell-shaped distribution curve with a shaded area between 995 and 1005
is shown. This curve has two arrows: one coming from the x value of 1005 and the other one coming from the x
value of 995. Both arrows are pointing to the left.
Download the Normal Distribution excel file (Refer to Norm_S_DIST_XZ - Third worksheet)
Transcript
So, basically, it's the same as before, except we have now a normal curve that looks like this, which is, it has a
mean of 0 and standard deviation of 1 and we are 0.5 away from on the right side and 0.5 away on the left side
and we are looking for this shaded area. So, this is 0.5 and negative 0.5. So, we will do the same thing. We will
find the value to the left of 0.5. And we will find the value to the left of negative 0.5, subtracting the two will give
me the answer that I'm looking for.
The slide shows two new cells placed below the x column. One of the cells says "p(x ≤ 0.5)" and the other says
"p(x ≤ −0.5)." The cell next to the p(x ≤ 0.5) cell is where the normal distribution function will be calculated. Once
129
written =NO, the program automatically unfolds eleven options that say: NOMINAL, NORM.DIST, NORM.INV,
NORM.S.DIST, and NORM.S.INV, among others. The fourth option (NORM.S.DIST) is selected in order to
compute the probability.
Transcript
So, to find those probabilities, I am going to say what is the probability of, technically speaking you should say z,
but I'm going to just say x less than or equal to 0.5. And then, I also want to know what is the probability of x less
than or equal to negative 0.5. And to solve for these, I'm going to use the norm.s.dist.
Once selecting =NORM.S.DIST, the program automatically shows the syntax of this function, which is
“=NORM.S.DIST(z, cumulative).” Here, the z value is 0.5. For the cumulative value two options unfold: "True -
cumulative distribution function" and "False - probability mass function." The cursor is selecting the "True" option,
which is represented by a #1 inside the function.
Transcript
And what you will see is that now this function has only two arguments. It only wants to know your z and that's it.
And the second one is cumulative, which we always will enter as 1. It is not asking you for the mean and it is not
asking you for standard deviation, because it knows that you're using a standard normal distribution. So, I'm
going to give the z of 0.5 and a 1 and it will in turn, and return the value of 0.69. And for this one, I'm going to use
a norm.s.dist.
130
Excel Sheet (4 of 9) - Slide 129
The slide shows the outcome of the normal distribution function where x is ≤ to 0.5, which is 0.6914. In addition,
the value of the normal distribution function when x is ≤ to −0.5 is also shown, which is 0.3085.
Transcript
Again, I'm going to click on that tab and this time I'm going to say negative 0.5 and a 1.
The slide shows a new probability cell below the p(x ≤ −0.5) cell called "p(−0.5 ≤ x ≤ 0.5)." The cell next to the
p(−0.5 ≤ x ≤ 0.5) cell is where the process of calculating the probability of x in and between −0.5 and 0.5 will
occur. The formula used in this process is "=B9 − B10", where B9 is 0.6914 (Probability of x ≤ to 0.5), and B10 is
0.3085 (Probability of x ≤ to −0.5).
Transcript
131
Excel Sheet (6 of 9) - Slide 131
The slide shows the outcome of the normal distribution function when x is in and between −0.5 and 0.5, which is
0.38292.
Transcript
So, ultimately the probability of being between negative 0.5 standard deviation and 0.5 standard deviation, it's
simply the difference between 0.5 minus the negative 0.5 z. And that would be 0.38, which is exactly what we
have. This is exactly what we had before when I used just norm.s.dist.
The slide shows the same information as the one presented in Slide 122 Normal Distribution Excel Sheet (11 of
11). To the left, the Mean, Standard deviation, and the three p cells are vertically listed. The list contains the
mean as 1000, the standard deviation as 10, the probability of x in and between 995 and 1005 as 0.3829, the
probability of x ≤ to 1005 as 0.6914, and the probability of x ≤ to 995 as 0.3085. In the middle of the slide, a bell-
shaped distribution curve with two shaded areas: in yellow is the area between 995 and 1005, and in green is the
area to the left of 995.
132
Transcript
The slide shows how the values of p(x ≤ 1005) and p(x ≤ 995) were deleted and a new formula will be added. To
the right of the p(x ≤ 1005) cell the normal distribution function will be calculated. Once written =NORM.DIST, the
program automatically shows the syntax of this function, which is “=NORM.DIST(x, mean, standard_dev,
cumulative).” Here, the x value is 1005, the mean is 1000, the standard deviation is 10, and the cumulative is 1
(True).
Transcript
So, what is the difference? In this one, I had to give, as you can see, I had to give the mean and the standard
deviation of my distribution. And it would automatically translate it back to standard normal and give me the
values.
133
The slide shows the same information as the one presented in Slide 131 Excel Sheet (6 of 9). Here, the values of
p(x ≤ 0.5) and p(x ≤ −0.5) were deleted and a new formula will be added. To the right of the p(x ≤ 0.5) cell the
normal distribution function will be calculated. Once written =NORM.S.DIST, the program automatically shows
the syntax of this function, which is “=NORM.S.DIST(z, cumulative).” Here, the z value is 0.5 and the cumulative
is 1 (True).
Download the Normal Distribution excel file (Refer to Norm_S_DIST - Fourth worksheet)
Transcript
Whereas, in here, I only had to give how many standard deviations I'm away and I don't need to enter the mean
and the standard deviation for this particular curve, which we had. So, I had to manually take an extra step by
calculating it first. So, obviously, this is less efficient, but there are times that you want to use norm.s.dist. So, it's
important for you to know the differences between the two functions.
X = 458
The slide shows a bell-shaped curve with symmetrical normal distribution. The curve spreads from −3 to 3 in the
x-axis and from 0.0 to 0.4 in the y-axis. It has a mean of 0 and the left area under the curve from −0.42 to the left
end of the curve is filled in blue.
Transcript
134
Excel Sheet (1 of 4) - Slide 136
The slide shows a new Excel sheet with three statistical inputs vertically shown: x, Mean, and Standard
deviation. These inputs have a value of 458, 500 and 100, respectively.
Download the SAT Example 1 excel file (Refer to x_Mean_SD - First worksheet)
Transcript
In this example, the person who has taken the SAT exam gets a value of 458 and this is an exam that has a
mean of 500. And, standard deviation of 100. So, as it turned out, this is a negative Z value, and we had to do
some other things to read it off the table. But, finding out what percentage of people have done less than 458, so
what is the percentile for the test-taker when that person takes the exam and gets a 458?
The slide shows a bell-shaped distribution curve with a mean of 500 and a shaded area from 458 to the left of the
curve. In addition, there is a new cell called "p(x ≤458)" which is placed below the statistic columns. The cell next
to the p(x ≤ 458) cell is where the normal distribution function will be calculated. Once written =NOR, the program
automatically unfolds eight options that say: NORM.DIST, NORM.INV, NORM.S.DIST, and NORM.S.INV, among
others. The first option (NORM.DIST) is selected in order to compute the probability.
135
Transcript
Again, picture-wise, looks just like this. So, here's your mean of 500. This person has done below mean, so at
least you know that it will be less than 50th percentile. So, we're looking for 458, right here, and this is what we're
looking for. So, it actually is very straight-forward when you're using Excel. So, all I have to do is say probability
of x being less than equal to 458, which would be the percentage of people who have less than this value.
After selecting =NORM.DIST, in the cell right to the p(x ≤ 458) cell, the program automatically shows the syntax
of this function, which is “=NORM.DIST(x, mean, standard_dev, cumulative).” Here, the formula is shown as
=NORM.DIST(H2, H3, H4, 1) where H2 is the x value of 458, H3 is the mean value of 500, H4 is the standard
deviation value of 100, and 1 is the "True - cumulative distribution function."
Transcript
The slide shows the outcome of the normal distribution function when x is ≤ to 458, which is 0.337243.
136
Download the SAT Example 1 excel file (Refer to x_Mean_SD_P - Second worksheet)
Transcript
It's just simply norm.dist, and our x is 458, our mean is 500, and our standard deviation is 100, and the value is 1.
So, what is that probability, of somebody getting less than 458? Is 33%.
The slide shows a bell-shaped histogram. The curve spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the
y-axis. Its peak is at 0.4 on the y-axis and 0 at the x-axis. The area of the curve at the left of 1.35 on the x-axis is
filled with blue.
Transcript
This example was in our lesson six of module two and I showed you how to get the answers using a normal
table. But I am going to illustrate to you how to do the same thing using Excel.
137
Excel Sheet (1 of 4) - Slide 141
To the left of the slide is the screenshot of Slide 140 Example - Continued. To the right, two statistical inputs
(Mean and Standard deviation) are vertically shown with their corresponding values: 500 and 100, respectively.
Below these statistical input, there is a cell called "p(x ≥ 635)."
Download the SAT Example 2 excel file (Refer to Mean_SD - First worksheet)
Transcript
Basically, what we have here is a population where the mean of the population is set at 500. This is for SAT and
their standard deviation is set at 100. We want to know, what is the probability that someone will get a score of
635 or better?
The slide shows a bell-shaped distribution curve with a mean of 500 and a shaded area from 635 to the right of
the curve. Here, the slide intends to show how to calculate the normal distribution of p(x ≥ 635). Once written
=NORM.DIST, in the cell next to the p(x ≥ 635) cell, the program automatically shows the syntax of this function,
which is “=NORM.DIST(x, mean, standard_dev, cumulative).”
138
Transcript
The way to answer this is again best to see what it looks like in picture. Our curve looks like this at 500 and we
want to know what is the probability of getting better than 635. So, we are looking for this probability. What Excel
gives us, just like normal table, is that everything to the left is given to me. In order for me to know this, I need to
find this value first and then subtract it from 1. So now to find the probability of the shaded area, all I need to do –
I know that it is 1 minus everything to the left of it. Now I'm just going to say norm.dist so I'm going to pick this
tab.
After selecting =NORM.DIST, the formula written for this function is =NORM.DIST(635, 500, 100, 1), where 635
is x, 500 is the mean, and 100 is the standard deviation. For the cumulative value, two options unfold: "True -
cumulative distribution function" and "False - probability mass function." The cursor is selecting the "True" option,
which is represented by a #1 inside the function.
Transcript
139
To the left of the slide is the screenshot of Slide 140 Example - Continued with the value 0.885 circled red. To the
right, the outcome of the normal distribution function when x is ≥ to 635, which is 0.088508.
Download the SAT Example 2 excel file (Refer to Mean_SD_P - Second worksheet)
Transcript
My X is 635, mean for SAT is 500, standard deviation is 100 and the last argument is always 1 and the answer to
that is 8%, which is exactly what we had in the slides.
For the SAT exam, what is the probability that a randomly selected test has a score between 490 and 550?
P(490 ≤ x ≤ 550)
P(−0.1 ≤ z ≤ 0.5)
The slide shows a bell-shaped histogram. The curve spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the
y-axis. Its peak is at 0.4 on the y-axis and 0 at the x-axis. The area under the curve from −0.1 to 0.5 on the x-axis
is filled in blue.
Transcript
In this example, we are asking the question of what is the probability that we pick a test taker at random and that
person has scored somewhere between 490 and 550 on their exam.
140
Excel Sheet (1 of 4) - Slide 146
The slide shows a new Excel sheet with two statistical inputs vertically shown: Mean and Standard deviation.
These inputs have a value of 500 and 100, respectively. Below these statistical inputs, is a new cell called "p(490
≤ x ≤ 550)." To the right, is a bell-shaped distribution curve with a shaded area between 490 and 550. In the
graph, there are two arrows: one coming from 550 and the other coming from 490. Both arrows are pointing to
the left.
Download the SAT Example 3 excel file (Refer to Mean_SD - First worksheet)
Transcript
So, once again, let me show you this in picture and then we'll use Excel to solve this problem. So, effectively,
what you are saying is that given our curve which is at 500 centered, we are looking for what is the probability
that the person we pick at random has a score that is 550 or less, but more than 490. So, we are looking for this
shaded area and as before, what we can do to solve this problem is that first use Excel to give us this value and
then use this also to find the value to the left of 490 and then what would be remain once we subtract these 2 is
the area we are looking for. So, how do we do this? We use the norm.dist.
The slide intends to show how to calculate the normal distribution of p(490 ≤ x ≤ 550). Once written
=NORM.DIST, in the cell next to the p(490 ≤ x ≤ 550) cell, the program automatically shows the syntax of this
141
function, which is “=NORM.DIST(x, mean, standard_dev, cumulative).” The formula written for this function is
=NORM.DIST(550, 500, 100, 1), where 550 is x, 500 is the mean, and 100 is the standard deviation. For the
cumulative value, two options unfold: "True - cumulative distribution function" and "False - probability mass
function." The cursor is selecting the "True" option, which is represented by a #1 inside the function.
Transcript
After writing the function =NORM.DIST(550, 500, 100, 1) the professor types in the same cell -NORM.DIST to
calculate the probability of x ≤ to 490. Here, 490 is x, 500 is the mean, 100 is the standard deviation, and 1 is the
"True - cumulative distribution function." In this cell the complete function looks like: "=NORM.DIST(550, 500,
100, 1)-NORM.DIST(490, 500, 100, 1)."
Transcript
142
Excel Sheet (4 of 4) - Slide 149
The slide shows the outcome of the normal distribution function when x is ≥ to 490 and ≤ to 550, which is
0.23129.
Download the SAT Example 3 excel file (Refer to Mean_SD_P - Second worksheet)
Transcript
So, here we will say it's equal to norm.dist, we pick that and first we're going to put the upper boundary which is
550 and 500 for the mean, 100 for standard deviation and a 1 and subtract from it the lower bound of our
boundary which is norm.dist for score of 490, mean of 500, standard deviation of 100 and a 1. Return, and very
quickly we will see that it's 23%. So, it's very quick once you know what you are doing with Excel and you
understand how to solve it. We can do this very quickly.
What test score will place you in the top 95th percentile?
143
Average (μ) = 500
P(x ≤ X) = 0.95
The slide shows a bell-shaped curved histogram. The curve spreads from −3 to 3 in the x-axis and from 0.0 to
0.4 in the y-axis. Its peak is at 0.4 on the y-axis and 0 at the x-axis. The area of the curve at the left of 1.645 on
the x-axis is filled with blue color.
Transcript
So in this example I'm going to solve the problem the other way, which means I know the percentile I'm shooting
for, I need to know what score qualifies me for that. Let's say the school that I'm really interested in applying to
takes the upper 95% of the population of test takers. So, I need to know what score would qualify me for that
95th percentile. So, again, let me show you what it means in pictures.
The slide shows a new Excel sheet with a bell-shaped distribution curve with a mean of 500. To the right of the
mean is a value named score which is shaded and has a proportion of 0.95.
Transcript
So, essentially, what we are saying is that for the test takers that are taking this test, and we know that the
average is at 500, I need to know what score I need to get so that I will be in the 95th percentile. So, what I know
is that value to the left of this is .95, and I am asking for what is the score here that would give me that value. And
this can be solved – and I'm going to show you both ways right here – this can be solved with the function known
as norm.inverse or norm.s.inverse.
144
Excel Sheet (2 of 9) - Slide 152
The slide shows a new cell called "Score at 95%" placed to the right of the bell-shaped curve. Here, the slide
intends to show how to calculate the NORM.INV function for the 95th percentile. Once written =NORM.INV, in
the cell next to the Score at 95% cell, the program automatically shows the syntax of this function, which is
“=NORM.INV(probability, mean, standard_dev).” The formula written for this function is =NORM.INV(.95, 500,
100), where 0.95 is the probability, 500 is the mean, and 100 is the standard deviation.
Transcript
So, let me show you both. So, what do I know is that what is the score at 95th percentile. Right? So to answer
this question, I'm going to say norm.inverse. So, it says that you're doing the inverse operation, so it returns the
inverse of the cumulative option. So, I'm going to say tab, and it says probability. So, it says, what probability are
you at, I'm at 95th percentile. The mean for the test is 500, standard deviation is 100, and that's it. It has three
arguments.
The slide shows the outcome of the score at 95%, which is 664.4854.
Download the SAT Example 4 excel file (Refer to Score - First worksheet)
145
Transcript
If I press return, you would get the score of 664.48. And you know SAT is an integer value, so that means that I
have to get a score of 665 to be on the 95th percentile. So, this is one option of doing it. Let me show you the
way that we solved it in the slides for Lesson Six, and that will show you the other function that is available.
The slide shows a new cell called "z-score" placed a few cells below the Score at 95% cell. Here, the slide
intends to show how to calculate the NORM.S.INV function for the 95th percentile. Once written =NORM.S.INV,
in the cell next to the z-score cell, the program automatically shows the syntax of this function, which is
“=NORM.S.INV(probability).” The formula written for this function is =NORM.S.INV(.95).
Transcript
146
Transcript
So here I want to know what is the z-score for 95th percentile. And if I want to know what is the z-score, I'm going
to say norm.s.inverse. Remember, z-score is always about how many standard deviations away are you from the
mean based on a standard normal distribution. So, all it asks is probability. It doesn't need to know mean. It
doesn't need to know standard deviation because it's a standard normal. So, it knows mean is 0 and standard
deviation is 1. So, I will put .95 here, and it says, roughly, you have to be about 1.645 standard deviations above
the mean.
The slide shows a new cell called "Score" placed below the z-score cell. Here, the slide intends to show how to
calculate the score value. The formula written for this function is =500+(H6*100), where 500 is the mean, H6 is
1.6448 (z-score), and 100 is the standard deviation.
Transcript
147
The slide shows the outcome of the score, which is 664.4854. This value is the same as the Score at 95% value.
Transcript
So, what does that mean in my score? It basically means that you have scored 500, that's what you are if you're
at the mean, but you're above the mean. By how much? By this value multiplied by the standard deviation, and
that would be 100. So, when I return, I get the same value, 664.485, exactly same values. So, does it matter
which one you use?
The slide shows a red square around the z-score and the score cells, along with the values calculated, with an
arrow pointing at the word NORM.S.INV written on the screenshot. Next to the Score at 95% value cell, the word
NORM.INV is also written.
Transcript
148
The slide shows a red circle around the score value (644.4854) to indicate that this score is the same
NORM.S.INV.
Download the SAT Example 4 excel file (Refer to Z_Score - Second worksheet)
Transcript
You will exactly get the same answers; however, if you use norm.inverse, you get the value immediately without
doing any other calculations yourself, whereas if you do, if you do it this way, then you have to take two steps,
and what we have done is that we have used norm.s.inverse to get the z-score, and then used it in the equation,
which gives us this value.
149
Lesson 2-6: Standard Normal Distribution Table
The standard normal distribution is a special case of the normal distribution that has a:
Mean of zero
Standard deviation of one
2 2
− ( x−μ ) ÷2σ
e
p(x)=
σ√2π
e = 2.71828
π= 3.14159
In the slide there is a bell-shaped curved histogram. The curve spreads from −3 to 3 in the x-axis and from 0.0 to
0.4 in the y-axis for density. Its peak is 0.4 on the y-axis and 0 on the x-axis. Here, there is a vertical red line at
the x-axis equals to 0.
Transcript
The standard normal distribution is a special case of the normal distribution. It has a mean of zero and a
standard deviation of one. The density function for this graph is given by this equation, where e is the constant
2.71828, and pi is 3.14. And, in the case of a standard normal distribution, μ is 0, and σ is 1.
150
Standard Normal Distribution (2 of 2) - Slide 161
2 2
− ( x−μ ) ÷2σ
e
p(x)=
σ√2π
The slide shows a bell-shaped curved histogram. The curve spreads from −3 to 3 in the x-axis and from 0.0 to
0.4 in the y-axis for density. Its peak is 0.4 on the y-axis and 0 on the x-axis. The area under the curve from −0.7
to 0.7 on the x-axis is filled with blue color. This shaded area represents the probability of a random variable.
Transcript
The shaded area in this graph represents the probability of a random variable falling within the two given ranges.
This probability is found by solving for this shaded area. The area under this curve has been solved for the
standard normal distribution, and we have access to it through a standard normal table also known as z-table.
151
Standard normal table
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.50000 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.52790 0.53188 0.53586
0.1 0.53983 0.54380 0.54776 0.55172 0.55567 0.55966 0.56360 0.56749 0.57142 0.57535
0.2 0.57926 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.62930 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.65910 0.66276 0.66640 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.70540 0.70884 0.71226 0.71566 0.71904 0.72240
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.75490
0.7 0.75804 0.76115 0.76424 0.76730 0.77035 0.77337 0.77637 0.77935 0.78230 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.86650 0.86864 0.87076 0.87286 0.87493 0.87698 0.87900 0.88100 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.90320 0.90490 0.90658 0.90824 0.90988 0.91149 0.91308 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.92220 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.94520 0.94630 0.94738 0.94845 0.94950 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.96080 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.97320 0.97381 0.97441 0.97500 0.97558 0.97615 0.97670
152
2 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.98030 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.98300 0.98341 0.98382 0.98422 0.98461 0.98500 0.98537 0.98574
2.2 0.98610 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.98840 0.98870 0.98899
2.3 0.98928 0.98956 0.98983 0.99010 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.99180 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.99430 0.99446 0.99461 0.99477 0.99492 0.99506 0.99520
2.6 0.99534 0.99547 0.99560 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.99720 0.99728 0.99736
2.8 0.99744 0.99752 0.99760 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.99900
153
The slide shows a bell-shaped curved histogram. The curve spreads from −3 to 3 in the x-axis with a mean of 0.
The area of the curve at the left of 0.9 on the x-axis is filled with blue color. This shaded area represents the
probability of a random variable of x less than a certain value.
Transcript
Here is an example of z. Although, in this class, we will be working on large datasets using Excel, it is still
important that you understand the concept of normal table and would know how to read it. There are many
displays of normal table. To know what is being displayed, you have to look at the legend. For this table, the
value represents the probability of x being less than z, so, basically, everything to the left of z. Using this table,
we will use the z-score to answer questions for any normal distribution. I will use this table to show you how to
answer some possible research questions. To learn how to use Excel, please watch the Excel videos.
The previous version of the standardized test SAT that high school students take had been designed so that
each of its sections (for example – math) have a mean of 500 and a standard deviation of 100. If someone
scores a 635, how well has this person done?
z = (value of interest − average of all values) ÷ standard deviation of all values = (635 − 500) ÷ 100 = 1.35
Transcript
The previous version of the standardized test, SAT, which high school students take, has been designed so that
each of its section, for example, math, to have a mean of 500 and standard deviation of 100. Let's say that
someone you know gets a score of 635. How well has this person done? First, we need to find the z-score. This
will tell us how many standard deviation this person is from the mean, and that is 635 − 500 divided by the
standard deviation of 100. This person's test score is above the mean by 1.35 standard deviation. This is the
person's z-score, and we can now see in what percentile this person will fall. So, let's look at the normal table to
find this value.
154
Example - Solved (1 of 3) - Slide 164
P(z ≤ 1.35)
The slide shows the table presented in Slide 162 Standard Normal Table. Here, row 1.3 and column 0.05 are
selected. The cross number between the row and column is 0.9115, which is the probability for 1.35.
Transcript
For this particular person, who is 1.35 above the mean, to find the proportion of test-takers who have scores less
than 635, we are looking for p of x less than or equal to 1.35. Now let's go to the table. For z of 1.35, first, we go
down to the row that has the value of 1.3 and then go to the column which has the heading of 0.05.
The slide shows the table presented in Slide 162 Standard Normal Table. Row 1.3 and column 0.05 are selected,
and the cross number between the row and column is 0.9115. Here, this number is highlighted in yellow.
155
Transcript
The value from the table is 0.9115, which is the probability of getting a score less than 635.
Z = 1.35
P(z ≤ 1.35)
This slide shows the table used in Slide 162 Standard Normal Table with the number 0.9115 highlighted and the
respective standard normal distribution curve for cumulative probability. The bell-shaped curved histogram
spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the y-axis. Its peak is at 0.4 on the y-axis and 0 on the x-
axis. The area of the curve at the left of 1.35 on the x-axis is filled with blue color.
Transcript
The shaded area under the curve represents the probability of being less than or equal to 1.35 standard
deviations above the mean, so this particular test-taker is about 91 percentile. As you can see, the z-score, in
this case 135, is placing this particular test-taker among all the people who took this test.
156
Example - Continued - Slide 167
The slide shows a bell-shaped curved histogram, which spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in
the y-axis. Its peak is at 0.4 on the y-axis and 0 on the x-axis. The area of the curve at the left of 1.35 on the x-
axis is filled with blue color. The shaded area is marked as P(z ≤ 1.35) and the rest area is marked as P(z ≥
1.35).
Transcript
One might ask the question in another direction. That is, "What is the probability of getting a score greater than
635?" A random variable, which is the test score is still 635, but now we are looking for probability of finding
people who will score higher than this value. This is simply 1 minus probability of getting a score less than or
equal to 635. This is because the total area under the curve is equal to 1. The blue, shaded area represents the
probability of z being less than or equal to 135, and the white area is the compliment, which is the probability of z
being greater than or equal to 1.35 for all those test-takers that score higher than 635. In this case, it is 1 −
0.9115 or about 0.0885.
157
Negative Z-Score (1 of 2) - Slide 168
X = 458
The slide shows a bell-shaped curved histogram, which spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in
the y-axis. Its peak is at 0.4 on the y-axis and 0 on the x-axis. The area of the curve at the left of −0.42 on the x-
axis is filled with blue color.
Transcript
Now, let's see how we can use the table if you get a negative z-score. Using the same example, our test-taker
now gets a score of 458 on a test, which has a mean of 500 and standard deviation of 100. The z-score for this
test-taker is a negative 0.42. The negative sign means that this test-taker is to the left of the mean, below
average, as shown in the graph. The blue, shaded area represents the scores below 458, and the white, the
larger area, shows all those who scored higher, so what percentile would this test-taker be?
158
z = −0.42 P(z ≤ −0.42)
The slide shows two bell-shaped curves of standard normal distribution. Both curves spread from −3 to 3 in the
x-axis and from 0.0 to 0.4 in the y-axis. Its peak is at 0.4 on the y-axis and 0 on the x-axis. The curve on the left,
has a shaded area left to −0.42 and it is marked as P(z ≤ −0.42). The curve on the right, has a shaded area right
to 0.42 and it is marked as P(z ≥ 0.42).
Transcript
The table that I'm using has all positive values, but knowing the properties of normal distribution, we can deduce
a negative z-value just as well. Remember that one of the properties of normal table is that it's symmetrical,
which was implied that the probability of finding a test score less than or equal to negative 0.42 standard
deviation from the mean is the same as finding a test scores that are greater than positive 0.42 standard
deviations above the mean. Here's the probability of finding that the score less than or equal to negative 0.42
standard deviation from the mean, and here's the probability of finding a test scores that are higher than positive
0.42 standard deviation above the mean. The two shaded areas are exactly the same due to the symmetrical
property of normal curve. We know how to use the table for positive z, so let's look at the table.
The slide shows a table and its respective graph. The table is the one presented in Slide 162 Standard Normal
Table. Here, row 0.4 and column 0.02 are selected. The cross number between the row and column is 0.6628,
which is the probability for 0.42. The graph is the standard normal distribution with cumulative probability from the
table. The bell-shaped curved histogram spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the y-axis. Its
peak is at 0.4 on the y-axis and 0 on the x-axis. The area of the curve at the right of 0.42 on the x-axis is filled
with blue color. The shaded area is marked as P(z ≥ 0.42) and the rest area is marked as P(z ≤ 0.42).
159
Transcript
Here's the part of the table that pertains to our positive z of 42. We first go to the row with the heading of 0.4 and
then go to the column with the heading of 0.02, and where these two intersect represents the probability of z
being less than or equal to 0.42, which is 0.6628. The probability represents the white area under the curve,
which is probability of z being less than or equal to 0.42. In another word, probability of z being less than or equal
to 0.42 is 0.6628. Thus, the probability of z being equal or greater than 0.42 is then 1 − 0.6628 or 0.3372, and
that is the blue, shaded area.
The slide shows two bell-shaped curves of standard normal distribution. Both curves spread from −3 to 3 in the
x-axis and from 0.0 to 0.4 in the y-axis. Its peak is at 0.4 on the y-axis and 0 on the x-axis. The curve on the left,
has a shaded area left to −0.42 and it is marked as P(z ≤ −0.42). The curve on the right, has a shaded area right
to 0.42 and it is marked as P(z ≥ 0.42). This curve has a red arrow pointing to the left of 0.42.
Transcript
So, we just found the probability of z being equal or greater than 0.42 is 0.3372. Also, because of symmetry, we
know that probability of z less than or equal to negative 0.42 is the same as the probability of z being more than
positive 0.42, which means 37.72% of test-takers have scores below 458, while 66.28% have higher score. So,
whenever you have a negative z-value, find the probability that corresponds to the positive z. What you get from
the table is the area to the right of the positive value. To get the blue shaded area, subtract that value from 1. We
did this for our example to get 0.3372.
160
Probability Between Two Values (1 of 5) - Slide 172
For the SAT exam, what is the probability that a randomly selected test has a score between 490 and 550?
P(490 ≤ x ≤ 550)
P(−0.1 ≤ z ≤ 0.5)
The slide shows a bell-shaped curved histogram, which spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in
the y-axis. Its peak is at 0.4 on the y-axis and 0 on the x-axis. The area of the curve from −0.1 to 0.5 is filled with
blue color.
Transcript
Now let's see how we can use the table to find probabilities of finding values within two points. Here we are going
to look at probability of finding a score between 490 and 550, given that the population is 500 and the standard
deviation is a 100. So, we first will find the z-score for the two points, and, thus, we're looking for probably of z
being between negative 0.1 and positive 0.5, and that would be this blue, shaded area under the curve.
161
The slide shows three bell-shaped curve of standard normal distribution. The three curves spread from −3 to 3 in
the x-axis and from 0.0 to 0.4 in the y-axis. Their peak is at 0.4 on the y-axis and 0 on the x-axis. The curve on
the left, has a shaded area to the left of 0.5 and it is marked as P(z ≤ 0.5). The curve in the middle, has a shaded
area to the left of −0.1 and it is marked as P(z ≤ −0.1). The curve on the right, has a shaded area from −0.1 to 0.5
and it is marked as P(−0.01 ≤ z ≤ 0.5). The area shaded in the left curve minus the area shaded in the middle
curve equals to the area shaded in the right curve.
Transcript
To find the probability of z being between negative 0.1 and positive 0.5, we first need to find the probability of z
being less than or equal to 0.5, then subtract from it probability of z being less than or equal to negative 0.1.
The slide shows the table presented in Slide 162 Standard Normal Table. In the table, the row 0.5 and column
0.00 are selected. The cross number between the row and column is 0.6915, which is highlighted. To the left of
the table, there is the respective bell-shaped curve of standard normal distribution for this value. This curve
spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the y-axis. Its peak is at 0.4 on the y-axis and 0 on the x-
axis. The area of the curve left to 0.5 is filled with blue color.
Transcript
The probability of z, less than or equal to 0.5 is from the table on the row with the heading 0.5 and the column of
0.00, which gives you 0.6915.
162
Probability Between Two Values (4 of 5) - Slide 175
The slide shows the table presented in Slide 162 Standard Normal Table. In the table, the row 0.1 and column
0.00 are selected. The cross number between the row and column is 0.5398, which is highlighted. To the left of
the table, there is the respective bell-shaped curve of standard normal distribution for this value. This curve
spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the y-axis. Its peak is at 0.4 on the y-axis and 0 on the x-
axis. The area of the curve left to −0.1 is filled with blue color.
Transcript
Probability of z being less than or equal to negative 0.1 can be solved the same way that we solved for negative
z-values. Read the probability from the table as if the z is a positive score, thus 0.1. Go to the row heading of 0.1
and column heading of 0.00, and we see that we get the value of 0.5398. But recall that in the negative z-value,
the probability that we read from the table actually is for the white shaded area. Therefore, the area of interest to
us, the blue, shaded area, is 1 − 0.5398 or 0.4602.
The slide shows three bell-shaped curves of standard normal distribution. The three curves spread from −3 to 3
in the x-axis and from 0.0 to 0.4 in the y-axis. Their peak is at 0.4 on the y-axis and 0 on the x-axis. The curve on
the left, has a shaded area to the left of 0.5 and it is marked as P(z ≤ 0.5) = 0.6915. The curve in the middle, has
163
a shaded area to the left of −0.1 and it is marked as P(z ≤ −0.1) = 0.4602. The curve on the right, has a shaded
area from −0.1 to 0.5 and it is marked as P(−0.1 ≤ z ≤ 0.5) = 0.6915 − 0.4602 = 0.2313. This means that the area
shaded in the left curve minus the area shaded in the middle curve equals to the area shaded in the right curve.
Transcript
Now we are ready to calculate the probability of z being between negative 0.1 and positive 0.5, and that is
0.6915 − 0.4602, which is 0.2313.
What is the probability of finding someone who has a score between 600 and 650?
The slide shows a closeup of the table presented in Slide 162 Standard Normal Table. Here, the rows that range
from 0.00 to 1.5, and the columns that range from 0.00 to 0.07 are shown.
Transcript
Now let's practice. Given the average score of 500 and standard deviation 100, what is the probability of finding
someone who has scored between 600 and 650? The values you need are in this partial normal table that is
displayed here.
164
Let's Practice - Solved - Slide 178
P(600 ≤ x ≤ 650) = ?
The slide shows the table presented in Slide 162 Standard Normal Table. In the table, the cross numbers
between row 1 and column 0.00 is 0.8413 and between row 1.5 and column 0.00 is 0.9332. These numbers are
highlighted in red and green, respectively.
Transcript
To answer this question, we will follow the same principle as we just learned. Find the z-value for 600 and 650,
which will be 1 and 1.5. Then, using the normal table, we can find the probability of z being less than or equal to
1.5 to be 0.9332, we highlighted here in green, and probability of z being less than or equal to 1 to be 0.8413,
highlighted in pink. Subtracting these two gives us 0.0919, thus probability of finding someone who scores
between 600 and 650 is 0.0919.
165
What test score will place you in the top 95th percentile?
P(x ≤ X) = 0.95
The slide shows a bell-shaped curved histogram, which spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in
the y-axis. Its peak is at 0.4 on the y-axis and 0 on the x-axis. The area of the curve at the left of 1.645 on the x-
axis is filled with blue color.
Transcript
Sometimes you know the percentiles that you're interested in but wonder how many standard deviations from the
mean would give you such percentile. Let's say that we would like one to be the 95th percentile for the test and
would want to know what score would give us that. Given what we know about the average of the test is 500, and
we also know that the standard deviation is 100, we are looking for a test score for which the probability of x
being less than or equal to this value is 0.95, which is this. To answer this question, we have to solve for z, where
the area to the left of it is 0.95.
P(x ≤ X) = 0.95
The slide shows the table presented in Slide 162 Standard Normal Table. In the table, the cross numbers
between row 1.6 and column 0.04 is 0.9495 and between row 1.6 and column 0.05 is 0.9505. These numbers
are highlighted. To the left of the table, there is the respective bell-shaped curve of standard normal distribution
for these values. The curve spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in the y-axis. Its peak is at 0.4
on the y-axis and 0 at the x-axis. The area of the curve left to 1.645 is filled with blue color.
Transcript
166
What is the Z Value? (3 of 4) - Slide 181
The slide shows the table presented in Slide 162 Standard Normal Table. This table is presented twice. The table
on the left, shows the cross number between row 1.6 and column 0.04 (0.9495) highlighted. This table is marked
as z = 1.64. The table on the right, shows the cross number between row 1.6 and column 0.05 (0.9505)
highlighted. This table is marked as z = 1.65. In the middle of the tables it says z = 1.645.
Transcript
To solve for z, we can look at the normal table and look for the value that comes closest to 0.95. Here we find two
values, 0.9495 and 0.9505, and the value of 0.95 is right in the middle of these two values, so we can extrapolate
for the value of z. z- value for 0.9495 is 1.64 and for 0.9505 is 1.65. Therefore, for 0.95, we can extrapolate that
the z is 1.645.
What test score will place you in the top 95th percentile?
P(x ≤ X) = 0.95
167
x = 500 + (1.645 × 100) = 664.5
The slide shows a bell-shaped curved histogram, which spreads from −3 to 3 in the x-axis and from 0.0 to 0.4 in
the y-axis. Its peak is at 0.4 on the y-axis and 0 on the x-axis. The area of the curve at the left of 1.645 on the x-
axis is filled with blue color. The function standard deviation "z = 1.645" is below the x-axis and has an arrow
pointing to the 1.645 value on the x-axis.
Transcript
So now we know that to be in the 95th percentile, the test score must be more than 1.645 standard deviations
above the mean. And that is simply the mean, 500, plus 1.645 times the standard deviation of 100, which will be
score of 644.5. Thus, anyone with this score will be at 95th percentile among all test-takers. In this lesson, you
learned how to use standard normal distribution to answer question about any normal distribution. By using the
table, we don't have to actually solve for the area under the curve, and that is a big timesaver.
168