MAT 211 CourseGuide - Lecture Notes - Spring - 2022
MAT 211 CourseGuide - Lecture Notes - Spring - 2022
MAT 211 CourseGuide - Lecture Notes - Spring - 2022
Offered by
Department of Physical Sciences
School of Engineering, Technology and Sciences
Important definitions
Data, elements, variable, observations, raw data, qualitative data, quantitative data, scales of measurement
population, random sample, census, sample survey, cross-sectional data, time-series data, Computer and
statistical analysis, glossary.
Textbook: Anderson D.R., Sweeney, D.J. and Thomas A.W. (2011), Statistics for Business and Economics
(11th Edition), South-Western, A Division of Thomson Learning.
Raw data – Data collected by survey, census etc. It is known as ungrouped data.
Note: Always we have raw data. We have to process or make data summaries by various statistical
techniques (we will learn all by Chapters 2-3).
Random sample: A subset of the target population. Set to set will vary for each of draws.
Sample survey: Method for collecting data about the random sample.
For the purpose of statistical analysis, distinguishing time series data and cross-sectional data are
meaningful.
Time series data – Data collected over several time periods. For example, Exchange rate, interest rate,
gross national product (GNP), gross domestic product (GDP) and many others. These sorts of data w.r.t
time are meaningful.
Note that in this course most of the data will be considered as cross-sectional data.
Statistical analyses generally involve a large amount of data. That’s why analysis frequently uses computer
software. Several very useful software are available in computing literature. These are: Minitab, R-
language, MATLAB, Excel, Strata, SPSS, and many others.
Scales of measurement
Before analysis, the scale of the selected variables has to be defined. Especially, when we do our analysis
by statistical packages e.g. R-language, SPSS, Minitab, Strata even in Excel also. We have to assign a
scale for each of the variables involved in our analysis.
There are four kinds of scale: nominal, ordinal, interval, and ratio
Nominal scale– The variables like Name, ID, Address, Cell # declare this scale. Not possible to do
analysis.
Ordinal scale – Qualitative data like test performances (excellent, good, poor etc), quality of food (good
or bad) etc. Possible to order. Some analysis is possible.
Interval scale: Shows properties of ordinal data and interval between values are meaningful. Example
Score for 5 students. Apply ordinal concept and differences of each of two students is meaningful.
Ratio scale – Have properties of Interval data. In addition ratio of the data values are meaningful. .
Example Score for 5 students. Apply interval concept and ratio of each of two students score is
meaningful.
HW: Text
Aim of presentation of raw data; Tabular form of raw data (e.g. summarizing qualitative and quantitative
data).
The aim of the presentation of raw data is to make a large and complicated set of raw data into a more
compact and meaningful form. Usually, one can summarize the raw data by
(c) Finally numerically such as measures of central tendency, measures of dispersion and others.
Under the tabular and the graphical form, we will learn frequency distribution (grouping data), bar graphs,
histograms, stem-leaf display method and others.
Presentation of data can be found in annual reports newspaper articles and research studies. Everyone is
exposed to those types of presentations. Hence, it is important to understand how they are prepared and
how they should be interpreted.
The plan of this lecture is to introduce the tabular methods, which are commonly used to summarize both
the qualitative and the quantitative data.
Summary
There are 6 students whose performances are excellent, 5 students show good performance and so on.
There are 40% students performances are excellent, 33% percent are good and so on.
8
Poor
6 Excellent
27%
40%
4
Good
2 33%
0
Excellent Good Poor
Data Summary: Our analysis shows that test performances observed excellent 40%, good 33% and poor
observed 26%.
90 88 78
87 69 93
56 78 57
67 85 46
95 59 89
We know very well these data are quantitative data. Processing of these kinds data little bit differs from
qualitative data. Follow the following:
Need to find lowest and highest values of the given raw data set. Here L = 46 and H= 95. Assume # of
classes K =5. Thus, we find size of the class c= (H-L)/K = 9.810.
Tabular Summary
Summary
There are 6 students who got scores 86 to 96, 3 students scored are 76 to 86, and so on. 40% of students
score 86 to 96, 20% of students score 76 to 86, and so on. 6 students score below 76, 4 students score below
66, and so on.
HW: Text
Recall Lecture 2, Table data. We need a frequency table for the above two shapes
Histogram Ogive
16
7
14
6
12
5
10
4 8
3 6
2 4
1 2
0 0
46-56 56-66 66-76 76-86 86-96 0 20 40 60 80 100 120
Data Summary
Our analysis shows that there are 6 students score observed 86 to 96 and 2 students score observed 46 to
56 and so on.
9 students’ scores were observed less than 86, 6 students’ scores were observed less than 76, and so on.
Line chart - Time plots of the stock indices (We need a time series data)
4 6
5 679
6 79
7 88
8 5789
9 035
Total n=15
Summary: There are 4 students whose scores are ranging 85 to 89 and so on.
HW: Text
So far we have focused on tabular and graphical methods for one variable at a time. Often we need tabular
and graphical summaries for two variables at a time.
Tabular Method-Cross-tabulation and Graphical method- scatter diagram are such two methods to make
decision from two qualitative and/or quantitative variables.
Tabular Method-Cross-tabulation:
Problem-1:
Consider the following two variables: Quality rating and meal price($) for 10 restaurants. Data are as
follows:
Quality rating: good, very good, good, excellent, very good, good, very good, very good, very good,
good
Meal price ($): 18,22,28,38,33,28,19,11,23,13.
Make a tabular summary (or cross-table and make a data summary).
Y
X 10-20 20-30 30-40 Total
Good || (2) || (2) (0) 4
Very good || (2) || (2) |(1) 5
Excellent (0) (0) |(1) 1
Total 4 4 2 n=10
Data summary:
We see that there are 2 restaurants their quality of food is very good and meal prices are ranging 20$ to
30$, 1 restaurant quality of food is excellent, 4 restaurants meal prices are ranging 10$ to 20$ and so on.
Strength
Shape – linear, curved etc.
Direction – positive or negative
Presence of outliers
Problem -2:
Now consider the following two variables: # of commercials and total sales for 5 sound equipment stores.
Data are as follows:
60
Data summary: There is a
50 positive relationship exists
40 between # of commercials and
total sales for 5 sound equipment
Sales
30
stores.
20
10
0
0 1 2 3 4 5 6
Comm
Figure: Scatter diagram of Sales and # of commercials for 5 sound equipment stores
HW: Text
Ex: 31, 33-36, pp.60-61
We will learn several numerical measures that provide a data summary using numeric formulas.
(1) Measures of average: simple mean, weighted mean, median, mode, quartiles, percentiles
Measures of average: simple mean, weighted mean, median, mode, quartiles, percentiles
Definition of average: It is a single central value that represents the whole set of data. Different
measures of averages are: simple mean, weighted mean, median, mode, quartiles, percentiles.
We will learn the above measures for the raw data and grouped data.
For example, for a set of monthly starting salaries of 5 graduates: 3450, 3550, 3550, 3480, 3355.
Define X - monthly starting salaries of 5 graduates. Here 𝑋̅ = (∑𝑛𝑖=1 𝑋𝑖 )/𝑛= 3477.
Data Summary:
Mean = 3477 it means that most of graduates monthly starting salaries is about 3477$.
75th percentile is known as 3rd quartile and it is denoted by p75. p50 is also known as 2nd quartile (Q2).
Thus, there are 3 quartiles: These are p25 (Q1), p50 (Q2) and p75(Q3).
Calculation of percentiles:
Consider the following data: 3450, 3550, 3550, 3480, 3355, 3490
Sort the data to calculate percentiles: 3355 3450 3480 3490 3550 3550
For Q2: i = (pn)/100 = (50*6)/100= 3. It is an average value of 3 rd and 4th observations of the sorted data.
Thus, Q2 = (3480+3490)/2 = 3485.
For Q3: i = (pn)/100 = (75*6)/100 = 4.5. The next integer 5. Thus, Q3 is 3550.
Data Summary:
Median = 3485 it means that 50% graduates monthly starting salaries are observed below 3485$ and the
remaining (50%) graduates monthly starting salaries are observed over 3485$.
First quartile = 3450 it means that 25% graduates monthly starting salaries are observed below 3450$ and
the remaining (75%) graduates monthly starting salaries are observed over 3450$.
Third quartile = 3550 it means that 75% graduates monthly starting salaries are observed below 3550$
and the remaining (25%) graduates monthly starting salaries are observed over 3550$.
Mode: It is the value that occurs with greatest frequency. Denoted by M0.
(2) 3450, 3550, 3550, 3480, 3450 - M0 are 3450 and 3550.
Data Summary:
Mode = 3450 it means that the most common graduates monthly starting salaries is 3450$.
Recall the concept of average (Ref. Lecture 4). Follow the following: Say for example, suppose we have
the following 2 sets of raw data:
Make a question - is there any difference exist between each of observations from the average value?
Suppose X – score of CT1 (class test 1) and for example, suppose it is calculated average score 15.
Next investigation will be to see differences between each of student’s marks to average marks.
If difference is 0, very easy to say student score and average score is same.
If differences give us a positive (negative) sign (+(-)), we can say that student score is greater(lower) than
the average score.
How we can measure variation of a data set. Various measures (or formulas) are available to detect
variation. These are:
1. Range, R = H-L, H-highest value of a data set and L – Lowest value of a data set
2. Inter-quartile range, IR = p75 – p25, p75- 75th percentile and p25- 25th percentile
1
3. Variance (denoted by 𝜎𝑥2 ) and is calculated by 𝜎𝑥2 = ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 .
𝑛−1
4. *Standard Deviation (denoted by 𝜎𝑥 ) and is calculated by
1
𝜎𝑥 = 𝑠𝑞𝑟𝑡(𝑛−1 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 ). That means 𝜎𝑥 =SD = sqrt(variance).
Note: Measures of variation cannot be negative. At least can be 0, recall which indicates all students got
same scores.
X (𝑥𝑖 − 𝑥̅ )2
3450 729
3550 5329
3550 5329
3480 9
3355 14884
1
Here variance, 𝜎𝑥2 = 𝑛−1 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 = 26280/4= 6570 and SD = sqrt(variance) = 81.05$.
Note: Variance cannot be interpreted because its unit comes as a square. For example if mean = 3477$
then variance = 6570$2. Taking square root of variance removes this problem (going back to the original
unit of data), which is standard deviation (SD).
Coefficient of variation
Requirements:
Syllabus:
1) Lecture 1- Lecture 5
2) Related Text book questions
/Good Luck /
So far we focused calculation of all measures of average and variation for ungrouped (raw data). Sometimes
grouped data (frequency table) is available. In this situation, formula for ungrouped (raw data) is invalid.
Follow the following:
X Frequency (# of students) –
fi, , i=1,2,…, 5
46-56 2
56-66 2
66-76 2
76-86 3
86-96 6
Total n =15
Data Summary: SD = 15.02 indicates that students score varies from the average score 77.
We can get a general impression of skewness by drawing a histogram. To understand the concept of
skewness, consider the following 3 histograms:
There are two types of skewness. These are (1) positively skewed or skewed to the right (2) negatively
skewed or skewed to the left.
Note that the normal/symmetric frequency curve is known as non-skewed curve (skewness is absent).
Definition: It gives us idea about the direction of variation of a raw data set.
Figure 1 says us most of students have poor performances. It means that most of students score below the
average value.
Figure 2 says us most of students have average performances. It means that most of students score near to
(more/less) the average value.
Figure 3 says us most of students have good performances. It means that most of students score over the
average value.
Measure of skewness
To detect whether skewness is present or not in a set of raw data, we will use the most commonly used
formula, known as Karl Pearson’s (known as Father of Statistics) coefficient of skewness. It is defined as
SK = 3(mean-median)/SD
Suppose X – test score. Let mean = 15, median (50th percentile or 2nd quartile) = 17 and SD =3.
Here SK = -2.00.
Data summary: SK = -2.00 it means that the test score is negatively skewed. It means that most of students
score over 15.
Kurtosis
Suppose if a distribution is symmetric, the next question is about the central peak: Is it high or sharp or
short or broad.
Pearson (1905) described kurtosis in comparison with the normal distribution and used phases leptokurtic,
platykurtic and mesokurtic to describe different distributions.
If the distribution has more values in the tails of the distribution and a peak, it is leptokurtic. It is a curve
like two heaping kangaroos has long tails and is peaked up in the center.
If there are fewer values in the tails, more in the shoulders and less in the peak, it is platykurtic.
A platykurtic curve, like a platypus, has a short tail and is flat-topped.
Review-Lecture 1–Lecture 8
The following data are obtained on a variable X, the cpu time in seconds required to run a program using a
statistical package:
b) Break these data into 6 classes and construct a frequency, relative frequency and
cumulative frequency table and interpret the tables using non-technical languages.
c) Using the frequency table, calculate sample mean and sample standard deviation and interpret these two
measures.
d) Construct a histogram. Construct also cumulative frequency ogive and use this ogive to approximate the
th
50 percentile, the first quartile and the third quartile.
Solution: Denote X - cpu time in seconds required to run a program using a statistical package.
(Please note that answers of the above questions can vary, please check your works very carefully).
Stem leaf
0 2
1 5
2 6
3 1488
4 114556669
5 22568
6 1128
7 179
8 01
Interpretation: Table 1 shows that to run 9 programs need time 4.1 to 4.9 seconds, 5 programs need 5.2
to 5.8 seconds and so on.
X(Classes) Frequency(fi)
__________________
0.2–1.5 1
1.5-2.8 1
2.8-4.1 6
4.1-5.4 10
5.4-6.7 6
6.7-8.1 6
__________________
n =30
1.5-2.8 0.03
2.8-4.1 0.20
4.1-5.4 0.33
5.4-6.7 0.20
6.7-8.1.1.1.1 0.20
_________________________________
rf
i 1
i =1
_______________________________________
0.2–1.5 1
1.5-2.8 2
2.8-4.1 8
5.4-6.7 18
5.4-6.7 24
6.7-8.4 30
___________________________________________
Interpretation: Table 2 shows that 9 programs need times 4.1 to 5.4 seconds, Table 3 shows that 30 percent
programs need times 4.1 to 5.4 seconds and Table 4 shows that 18 programs need at most 5.4 seconds and
so on.
X 30 0.20 8.1
Interpretation:
Mean = 5.0 seconds means that most of times to run a program need approx. 5 seconds.
Median = 4.75 seconds means that 50% programs to run need less than 4.75 seconds and rest of 50% need
more than 4.75 seconds.
Standard deviation = 1.859 seconds means all the times a program did not take 5 seconds to run.
Q1 = 4.02 seconds means that first 25% programs to run need less than 4.02 seconds and rest of 75%
need more than 4.02 seconds.
Q3 = 6.12 seconds means that first 75% programs to run need less than 6.12 seconds and rest of 25%
need more than 6.12 seconds.
Formulae
Mean - Ungrouped data:
1 n
Formula: x = x i , where is the summation sign.
n i 1
1 k
Formula: x = f i m i , where fi are the frequency of the ith class and mi are the midpoints of the ith
n i 1
class, midpoint = (LCL+UCL) of the ith class/2 and k are the total no. of classes.
To obtain the median of an ungrouped data set, arrange data in ascending order (smallest value to largest
value). n is odd, median is at position (n+1)2 in the ordered list. n is even, median is the mean at positions
(n/2) and (n/2)+1 in the ordered list.
n2 FM e 1
Formula: M e l M e c , where l Me is the LCL of the median class, FMe 1 is the cf below the
f M e
median class, f Me is the frequency of the median class and c is the size of the median class.
To calculate percentile for a small set of data, arrange the data in ascending order.
Compute an index i, where i = (p/100)n, p is the percentile of interest and n is the total no. of observations.
If i is not an integer, round up. The next integer greater than i denote the position of the pth percentile.
If i is an integer, the pth percentile is the mean of the value of the positions i and i + 1.
1 n 1 n
Formula: i
n 1 i 1
( x x ) 2
, where x i are the raw data and x = xi .
n i 1
1 n
Formula:
n 1 i 1
f i (m i x ) 2 , where fi are the frequencies of the ith class, mi are the mid-points
1 k
of the ith class interval and x = fimi .
n i 1
d) Histogram of X
9
# of programs
0
0.2-1.5 1.5-2.8 2.8-4.1 4.1-5.4 5.4-6.7 6.7-8.1
cpu time
Ogive of X:
Do by yourself
e) Sample skewness
skewness(X) = 0.4034
3( x M e )
Formula - S K , where 3 S k 3 i.e. skewness can range from -3 to +3. Interpretation
of Sk
A value near -3, such as – 2.78 indicates considerable negative skewness.
Interpretation: Sk = 0.4034 indicate that to run a few programs need time more than 5 seconds.
Random experiment, random variable, sample space, events (simple event, compound event), counting
rules, combinations, permutations, tree diagram, probability defined on events
Introduction
We finished our first important part of the course (known as data summary). Even we sat for CT1. Now we
are moving in the 2nd very important part of the course, namely “Chance Theory”. It is also known as
“Probability Theory”. The word “Chance or Probability” frequently we are using in our real life. For
examples:
(i) What is the chance of getting grade-A for the course MAT 211?
(ii) What is the chance that sales will decease if we increase prices of a commodity?
(iii) What is the chance that a new investment will be profitable?
To understand consider a situation. For example if we ask the following question to 3 students:
What is the chance of getting grade-A for the course MAT 211?
Let’s explain their predicted values under the chance theory. What we can observe:
Student 1 is 95% confident he/she is getting grade A. That means past experience tells us out of 100
students, 95 students had grade A.
Student 2 is 100% confident he/she is getting grade A. All the students got grade A.
Student 3 is only 10% confident (less confident) he/she is getting grade A. Only 10 students out of 100
students got grade A.
How calculated?
Recall the relative frequency method, where relative frequency = frequency/n and apply this. We will get
the answer. Suppose n =100, # of students got grade A is 95 (frequency), here probability is 0.95.
To calculate probability of an event (recall in the previous example, one possible event grade A), we have
to very familiar with the following terms:
Random experiment – It is the process of getting all possible events. Events are also known as outcomes.
Random variable- It is denoted by r.v. It is the event which one we will be interested from the all possible
outcomes. In the previous example, grade A is the random variable.
Sample Space: It is very very important. Without, it will not be possible to calculate chance of an event.
It is denoted by S. It is all the possible outcomes of a random experiment. It is just the set (recall set –
collection of all objects). Sometimes it will be not possible to calculate easily (note that to get the idea about
S, we have to practice a lot!).
Our knowledge, tree diagram (a wonderful method) and counting rules (permutation, combination). We
will use combination approach most of times; however permutation approach also will be used.
Random experiment 1:
Toss a fair coin. S = {H,T}, H-head and T-tail. If E – head, then P(H) = ½ = 0.5 and P(T) = ½ = 0.5.
Random experiment 2:
Select a part for inspection, S = {defective, non-defective}.
Random experiment 3:
Conduct a sales call, S = {purchase, no purchase}.
Random experiment 4:
Roll a fair die, S = {1,2,3,4,5,6}.
Random experiment 5:
Note that in the sample space, S all possible events read as “or”. Be careful not “and”. It is impossible to
get H or T in one experiment. Win and lose in one game is also impossible (realize it!)
Important concepts
Mutually exclusive events – It is the event where two possible events cannot occur simultaneously. Toss a
coin, H and T cannot occur in a single random experiment. It is written as P(HT) =0.
If P(HT) 0, events are mutually inclusive. Toss two coins (or one coin two times), H and T can occur in
this random experiment. For example, p(HT) =0.50, where S = {HH, HT, TH, TT}.
Equally like events – Two events has equal chance of being occur. Toss a coin, P(H) = P(T) = 0.5.
Tree diagram – It is a technique to make a summary of all possible events of a random experiment
graphically.
Combination – It allows one to count the number of experimental outcomes when the experiment involves
selecting n objects from a set of N objects.
10!
𝑁𝑐𝑛= 𝑁! = = 252 Possible ways students can be selected, here S = 252.
𝑛!(𝑁−𝑛)!
5!(10−5)!
Permutation – It allows one to count the number of experimental outcomes when n objects are to be
selected from a set of N objects, where the order of selection is important.
For example, if we want to select 5 students from a group of 10 students ( where order is important , then
10!
𝑁𝑝𝑛= 𝑁! = (10−5)! = 30240 Possible ways students can be selected, here S = 30240.
(𝑁−𝑛)!
Ex: 1. How many ways can three items can be selected from a group of six items? Use the letters A, B, C,
D, E and F to identify the items and list each of the different combinations of three items.
6!
Solution: S = 6𝑐3 = 3!(6−3)! = 20 possible ways letters can be selected. Some examples, ABC, ABD, ABE,
ABF, ……. DEF.
Ex: 2. How many permutations of three items can be selected from a group of six items? Use the letters A,
B, C, D, E and F to identify the items and list each of the different permutations of items B, D and F.
6!
Solution: S = 6𝑃3 = (6−3)! = 120 possible ways letters can be selected.
Different permutations of items B, D and F: BDF, BFD, DBF, DFB, FDB, FBD, 6 outcomes.
Solution: S = {E1, E2, E3}. Here P(E1) = 20/50=0.40, P(E2) = 13/50=0.26, P(E3) = 17/50=0.34
Ex:4: A decision maker subjectively assigned the following probabilities to the four outcomes of an
experiment: P(E1) = 0.10, P(E2) = 0.15, P(E3) = 0.40 and P(E4) = 0.20. Are these probability assignments
valid? Explain
Solution: S = {E1, E2, E3, E4}. Here p(E1) = 0.10, p(E2) = 0.15, p(E3) = 0.40, p(E4) = 0.20.
The above two problems tell us for any random experiment, P(S) = 1.
Ex:5: Suppose that a manager of a large apartment complex provide the following probability estimates
about the number of vacancies that will exist next month
Vacancies: 0 1 2 3 4 5
Probability: 0.05 0.15 0.35 0.25 0.10 0.10
a. No vacancies
b. At least four vacancies
c. Two or fewer vacancies
Solution: S={0,1,2,3,4,5}
a. p(0)=0.05
Participants
Activity Male Female
Bicycle riding 22.2 21.0
Camping 25.6 24.3
Exercise walking 28.7 57.7
Exercising with equipment 20.4 24.4
Swimming 26.4 34.4
a. For a randomly selected female, estimate the probability of participation in each of the sports
activities
b. For a randomly selected male, estimate the probability of participation in each of the sports
activities
c. For a randomly selected person, what is the probability the person participates to exercise walking?
d. Suppose you just happen to see an exercise walker going by. What is the probability the walker is
a woman? What is the probability the walker is a man?
Solution: S = {Br, C, EW, EE,S}, where Br - Bicycle riding, C- Camping, EW- Exercise walking, EE-
Exercising with equipment and S – Swimming.
a. Female can come from any sports activities. Thus P(F) = (21/248.5) +(24.3/248.5) +… +
(34.4/248.5).
b. Male can come from any sports activities. Thus P(M) = (22.2/248.5) +(25.6/248.5) +… +
(26.4/248.5).
c. Person can be male or female. Thus, P(EW) = P(Male EW) +P(Female EW) = (28.7/248.5)
+(57.7/248.5)=86.4/248.5=0.34 = 34%.
d. We have to consider exercise walker population. Thus, P(woman/EW) = 57.7/ (28.7+57.7) =
57.7/86.4 = 0.67 = 67%. P(man/EW) = 28.7/ (28.7+57.7) = 28.7/86.4 = 0.33 = 33%.
HW: Textbook
Basic relationships of probability (addition law, complement law, conditional law, multiplication law)
Addition Law
Suppose we have two events A and B (A, B S). The chance of occurring A or B is written as
P(AB) = P(A) + P(B) - P(AB), if two events are not mutually exclusive.
P(AB) = P(A) + P(B), if two events are mutually exclusive.
Problem 1
Consider a case of a small assembly plant with 50 employees. Suppose on occasion, some of the workers
fail to meet the performances standards by completing work late or assembly a defective product. At the
end of a performance evaluation period, the production manager found that 5 of the 50 workers completed
work late, 6 of the 50 workers assembled a defective product and 2 of the 50 workers both completed work
late and assembled a defective product. Suppose one employee if selected randomly what is the probability
that the worker completed work as late or will assembled a defective product?
Solution: Let L- work is completed late, D - assembled product as defective. Total employees S = 50.
We have to find P(LD). We know P(LD) = P(L) + P(D) - P(LD) = (5/50)+(6/50) – (2/50) = 0.10+0.12-
0.04 = 0.18 = 18%.
The chance is 18% the worker completed work as late or will assembled a defective product.
Problem 2
A telephone survey to determine viewer response to a new television show obtained the following data
(i) What is the chance that the viewer will rate the new show as average or better?
(ii) What is the chance that the viewer will rate the new show as average or worse?
The viewer will rate the new show as average or better, chance is 76%.
The viewer will rate the new show as average or worse, chance is 46%.
Suppose we have one event A, then the chance of not getting A event is defined as
Keyword: not
Recall Problem 1
(i) What is the chance that the randomly selected worker completed work will not be late?
(ii) Suppose one employee if selected randomly what is the probability that the worker completed
work as late nor will assembled a defective product?
The chance is 90% that the randomly selected worker completed work will not be late.
The chance is 82% that the randomly selected worker completed work as late nor will assembled a defective
product.
Suppose we have two events A and B (A, B S), the chance of getting A when B is known (or B when A
is known) is defined as
Roll a die. What is the chance of getting the die will show
(i) 2
(ii) Even number
(iii) 2 or even number
(iv) Not 2
(v) 2 given that die will show even number
(vi) 2 given that die will show odd number
Solution: S = 6. (i) P(2) = 1/6 (ii) P(Even number) =3/6 (iii) P(2even number) = (1/6)+(3/6)-(1/6)
Observe carefully (i) to (iv) are unconditional probabilities, but (v) to (vi) are conditional probabilities.
Here to calculate (i) to (iv) we used unconditional sample space, whether to calculate (v) to (vi) we used
conditional sample space, where has given condition from the roll we need even or odd numbers.
Multiplication law
Suppose we have two events A and B (A, B S), the chance of getting A and B is defined as
Problem
Consider the situation of the promotion status of male and female officers of a major metropolitan police
force in the eastern United States. The force consists of 1200 officers, 960 men and 240 women. Over the
past two years 324 officers on the public force received promotions. The specific breakdown of promotions
for male and female officers is shown in the following Table
Table: Promotion status of police officers over the past two years
b) P(Men) = 0.80, P(Women) = 0.20, P(Promoted) = 0.27, P(Not Promoted) = 0.73, these are known as
marginal probabilities.
c) P(Promoted/Men)=288/960.
e) P(Male/Promotion) = 288/324.
Requirements:
Syllabus:
3) Lecture 8-Lecture 10
4) Related Text book questions
/Good Luck/
It is invented by Abraham de Moivre, a French mathematician in 1733. The form or shape can be given in
the following:
The mathematical equation depends upon the two parameters mean () and standard deviation () follows:
X
1 1 / 2 ( )2
f (X ) e
, X
2 2
where = mean of normal variable, = SD of the normal variable ( and determine the location and
shape of the normal probability distribution) and e are mathematical constants, which values are equal to
3.14 and 2.728 respectively.
By notation X ~ N(,) read as X is normally distributed with mean and standard deviation .
It is true that once and are specified, the normal curve is completely determined.
A random variable that has a normal distribution with a mean of zero and standard deviation of one is said
to have a standard normal probability distribution.
The letter Z is commonly used to designate this particular normal random variable.
The standard normal probability distribution, areas under the normal curve have been computed and are
available in tables that can be used in computing probabilities.
The reason for discussing the standard normal distribution so extensively is that probabilities for all normal
distributions are computed by using the standard normal distribution. That is, when we have a normal
distribution with any mean and standard deviation, we answer probability questions about the distribution
by first converting to the standard normal distribution. Then we can use Table and the appropriate Z values
to find the desired probabilities.
Problem
Consider according to a survey, subscribers to The Wall Street Journal Interactive Edition spend average
of 27 hours per week using the computer at work. Assume the normal distribution applies and that the
standard deviation is 8 hours.
a) What is the probability a randomly selected subscriber spends less than 11 hours using the computer
at work?
b) What percentage of the subscribers spends more than 40 hours per week using the computer at
work?
c) A person is classified as a heavy user if he or she is in the upper 20% in terms of hours of usage.
How many hours must a subscriber use the computer in order to be classified as a heavy user?
Solution
Denote X = No. of hours per week using the computer at work, where X~N(27, SD = 8)
c) Find X when p = 0.20. Thus, X = + Z = 27 + 8Z. When p = 0.20, then find Z and substitute the Z-
value to get the value of X.
Requirements:
Syllabus:
Our step here how to collect random samples from the target population and how to summarize collected
raw data effective ways so that general peoples can understand so clearly.
Generally, there are two ways the required information may be obtained:
a) Census survey and
b) Sample survey.
The total count of all units of the population for a certain characteristics known as complete enumeration,
also termed census survey.
Money, manpower and time required for carrying out complete enumeration will generally be large and
there are many situations where complete enumeration is not possible. Thus, sample enumeration or sample
survey is used to select a random part of the population using the table of random numbers (e.g. see Text,
p.269) have been constructed by each of the digits 0,1, …, 9.
Suppose the monthly pocket money (TK/-) given to each of the 50 School of Business students at IUB as
follows:
1100 1500 8900 4500 2700 3800 3000 6700 2600 3600
7500 7900 4600 2000 2400 1300 8500 6500 6200 5800
6000 6800 9200 3800 1200 8000 7100 8600 8700 6300
7600 7700 2600 7800 2000 9000 7300 8400 1700 2500
5700 5300 5500 1700 3700 5400 2400 4000 1200 7300
To draw a random sample of size 10 from a population of size 50, first of all, need to identify the 50 units
of the population with the numbers 1 to 50.
1100(01) 1500(02) 8900(03) 4500(04) 2700 3800 3000 6700 2600 3600(10)
(05) (06) (07) (08) (09)
7500(11) 7900 4600 2000 2400 1300 8500 6500 6200 5800(20)
(12) (13) (14) (15) (16) (17) (18) (19)
6000 6800 9200 3800 1200 8000 7100 8600 8700 6300
(21) (22) (23) (24) (25) (26) (27) (28) (29) (30)
7600(31) 7700 2600(33) 7800 2000 9000 7300 8400(38) 1700 2500
(32) (34) (35) (36) (37) (39) (40)
5700 5300 5500 1700 3700 5400 2400 4000 1200(49) 7300
(41) (42) (43) (44) (45) (46) (47) (48) (50)
Then, in the given random number table, starting with the first number and moving row wise (or column
wise or diagonal wise) to pick out the numbers in pairs, one by one, ignoring those numbers which are
# Selected row-wise monthly pocket money (TK/-) of 10 students out of 50: 7100, 2400, 3700, 7500, 1500,
2000, 6500, 3000, 1700, 7600
HW:
Calculate mean and standard deviation of 10 students’ monthly pocket money (Use formula and
scientific calculator)
The sample statistic is calculated from the sample data and the population parameter is inferred (or
estimated) from this sample statistic. In alternative words, statistics are calculated; parameters are
estimated.
Point Estimate – It is the single best value. For example, mean and SD of total marks for a course of IUB
students are point estimates because these are single value.
The point estimate is varying for sample to sample and going to be different from the population parameter
because due to the sampling error. There is no way to know who close it is to the actual parameter. For this
reason, statisticians like to give an interval estimate (confidence interval), which is a range of values used
to estimate the parameter.
A confidence interval is an interval estimate with a specific level of confidence. A level of confidence is
the probability that the interval estimate will contain the parameter. The level of confidence is 1 - . 1-
area lies within the confidence interval.
Problem
Suppose, total marks for a course of 35 randomly selected IUB students is normally distributed with mean
78 and SD 9. Find 90%, 95% and 99% confidence intervals for population mean . Make a summary based
on findings.
Solution:
We have given X~N(78,9), where X - total marks for a course of 10 randomly selected IUB students and
n=35.
𝜎̂𝑥 𝜎̂𝑥
𝑥̅ − 𝑍/2 < 𝜇 < 𝑥̅ + 𝑍/2
√𝑛 √𝑛
Here 𝑥̅ =78, 𝜎̂𝑥 =9, n=35, =1-0.90 = 0.10, /2 = 0.05 and 𝑍/2 =𝑍0.05 =1.65
Thus,
9 9
78 − 1.65 < 𝜇 < 78 + 1.65
√35 √35
Summary: Based on our findings, we are 90% confident that population mean is ranging 75.5 to 80.5.
𝜎̂𝑥 𝜎̂𝑥
𝑥̅ − 𝑍/2 < 𝜇 < 𝑥̅ + 𝑍/2
√𝑛 √𝑛
Here 𝑥̅ =78, 𝜎̂𝑥 =9, n=35, =1-0.95 = 0.05, /2 = 0.025 and 𝑍/2 =𝑍0.025 =1.96
Thus,
9 9
78 − 1.96 < 𝜇 < 78 + 1.96
√35 √35
Summary:
Based on our findings, we are 95% confident that population mean is ranging 75.01 to 80.98.
𝜎̂𝑥 𝜎̂𝑥
𝑥̅ − 𝑍/2 < 𝜇 < 𝑥̅ + 𝑍/2
√𝑛 √𝑛
Here 𝑥̅ =78, 𝜎̂𝑥 =9, n=35, =1-0.99 = 0.01, /2 = 0.005 and 𝑍/2 =𝑍0.005 =2.58
Thus,
9 9
78 − 2.58 < 𝜇 < 78 + 2.58
√35 √35
Summary:
Based on our findings, we are 99% confident that population mean is ranging 74.07 to 81.92.
Practice problems
1. In an effort to estimate the mean amount spent per customer for dinner at a major Atlanta restaurant, data
were collected for a sample of 49 customers over a three-week period. Assume a population deviation of
$2.50.
a. At a 95% confidence level, what is the margin of error?
b. If the sample mean is $22.6, what is the 95% confidence interval for the population mean?
(Solve it)
2. Have a machine filling bags of popcorn; weight of bags known to be normally distributed with mean
weight 14.1 oz and SD 0.3 oz. Take sample of 40 bags, what’s a 95% confidence interval for population
mean ?
Guideline:
a) X - Weight of bags. Here n=40, 𝑥̅ =14.1, 𝜎̂𝑥 =0.3 =1-0.95 = 0.05, /2 = 0.025 and
𝑍/2 =𝑍0.025 =1.96
(Solve it)
3. The National Quality Research Center at the University of Michigan provides a quarterly measure of
consumer opinions about products and services (The Wall Street Journal, February 18, 2013). A survey of
40 restaurants in the Fast Food/ Pizza group showed a sample mean customer satisfaction index of 71. Past
data indicate that the population standard deviation of the index has been relatively stable with =5.
Guideline:
4. The undergraduate GPA for students admitted to the top graduate business schools is 3.37. Assume this
estimate is based on a sample of 120 students admitted to the top schools. Using past years' data, the
population standard deviation can be assumed known with .28. What is the 95% confidence interval
estimate of the mean undergraduate GPA for students admitted to the top graduate business schools?
Guideline:
HW: Text,
When sample size is less than 30 i.e. n<30, the mean has a Student's t distribution. The Student's t
distribution was created by William S. Gosset, an Irish worker. He wouldn't allow him to publish his work
under his name, so he used the pseudonym "Student".
𝜎̂𝑥 𝜎̂𝑥
𝑥̅ − 𝑡,(𝑛−1) < 𝜇 < 𝑥̅ + 𝑡,(𝑛−1)
2 √𝑛 2 √𝑛
Problem
Suppose we have given sample heights of 20 IUB students, where = 67.3", SD = 3.6" and the distribution
is symmetric. Develop 95% confidence interval for and make a summary based on your findings.
Solution:
We have given X~N(67.3,3.6), where X - heights of 20 randomly selected IUB students and n=20.
𝜎̂𝑥 𝜎̂𝑥
𝑥̅ − 𝑡,(𝑛−1) < 𝜇 < 𝑥̅ + 𝑡,(𝑛−1)
2 √𝑛 2 √𝑛
Here 𝑥̅ =67.3, 𝜎̂𝑥 =3.6, n=25, =1-0.95 = 0.05, /2 = 0.025 and 𝑡,(𝑛−1) =𝑡0.025 =2.093
2
Thus,
3.6 3.6
67.3 − 2.093 < 𝜇 < 67.3 + 2.093
√20 √20
Summary: Based on our findings, we are 95% confident that population mean is ranging 65.61 to 68.98.
1. The International Air Transport Association surveys business travelers to develop quality ratings for
transatlantic gateway airports. The maximum possible rating is 10. Suppose a simple random sample of 25
business travelers is selected and each traveler is asked to provide a rating for the Miami International
Airport. The ratings obtained from the sample of 25 business travelers follow.
6, 4, 6, 8, 7, 7, 6, 3, 3, 8, 10, 4, 8, 7, 8, 7, 5, 9, 5, 8, 4, 3, 8, 5,5
Develop a 95% confidence interval estimate of the population mean rating for Miami.
3. Have a machine filling bags of popcorn; weight of bags known to be normally distributed with mean
weight 10.5 oz and SD 0.8 oz. Take sample of 10 bags, what’s a 90% confidence interval for population
mean ?
We have learned that estimates of population means can be made from sample means, and confidence
intervals can be constructed to better describe those estimates. Similarly, we can estimate a population
standard deviation from a sample standard deviation, and when the original population is normally
distributed, we can construct confidence intervals of the standard deviation as well
Variances and standard deviations are a very different type of measure than an average, so we can expect
some major differences in the way estimates are made.
We know that the population variance formula, when used on a sample, does not give an unbiased estimate
of the population variance. In fact, it tends to underestimate the actual population variance. For that reason,
there are two formulas for variance, one for a population and one for a sample. The sample variance formula
is an unbiased estimator of the population variance.
Also, both variance and standard deviation are nonnegative numbers. Since neither can take on a negative
value, thus the normal distribution cannot be the distribution of a variance or a standard deviation. It can be
(𝑛−1)𝑆 2
shown that if the original population of data is normally distributed, then the expression 𝜎2
has a chi-
square distribution with n−1 degrees of freedom.
(𝑛−1)𝑆 2
The chi-square distribution of the quantity 𝜎 2 allows us to construct confidence intervals for the
variance and the standard deviation (when the original population of data is normally distributed).
(𝒏 − 𝟏)𝑺𝟐 (𝒏 − 𝟏)𝑺𝟐
< 𝝈𝟐 <
𝟐𝜶/𝟐 𝟐(𝟏−𝜶)
𝟐
where the 2𝛼/2 values are based on a chi-square distribution with n-1 degress of freedom and 1- is the
confidence coefficient (Details see, Text, p.440)
(𝒏 − 𝟏)𝑺𝟐 (𝒏 − 𝟏)𝑺𝟐
√ < 𝜎 < √
𝜶/𝟐
𝟐 𝟐
(𝟏−𝜶)
𝟐
where the 2𝛼/2 values are based on a chi-square distribution with n-1 degress of freedom and 1- is the
confidence coefficient (Details see, Text, p.440).
Problem-1
A statistician chooses 27 randomly selected dates and when examining the occupancy records of a particular
motel for those dates, finds a standard deviation of 5.86 rooms rented. If the number of rooms rented is
normally distributed, find the 95% confidence interval for the population standard deviation of the number
of rooms rented.
Solution:
Here
(𝟐𝟕 − 𝟏)𝟓. 𝟖𝟔𝟐 (𝟐𝟕 − 𝟏)𝟓. 𝟖𝟔𝟐
√ <𝜎<√
𝟒𝟏. 𝟗𝟐𝟑 𝟏𝟑. 𝟖𝟒𝟒
Summary: Based on our findings, we are 95% confident that population standard deviation is ranging
4.615 to 8.031.
Problem-2
A statistician chooses 27 randomly selected dates and when examining the occupancy records of a particular
motel for those dates, finds a standard deviation of 5.86 rooms rented. If the number of rooms rented is
normally distributed, find the 95% confidence interval for the population variance of the number of rooms
rented.
Solution:
Here
(𝟐𝟕 − 𝟏)𝟓. 𝟖𝟔𝟐 (𝒏 − 𝟏)𝟓. 𝟖𝟔𝟐
< 𝝈𝟐 <
𝟒𝟏. 𝟗𝟐𝟑 𝟏𝟑. 𝟖𝟒𝟒
Summary: Based on our findings, we are 95% confident that population variance is ranging 21.297 to
64.492
Practice problems
1. The variance in drug weights is critical in the pharmaceutical industry. For a specific drug, with
weights measured in grams, a sample of 18 units provided a sample variance of s2=0.36.
a. Construct a 90% confidence interval estimate of the population variance for the weight of this
drug.
b. Construct a 90% confidence interval estimate of the population standard deviation.
2. The daily car rental rates for a sample of eight cities follow:
Atlanta 69
Chicago 72
Dallas 75
New Orleans 67
Phoenix 62
Pittsburgh 65
San Francisco 61
Seattle 59
a. Compute the sample variance and the sample standard deviation for these data.
b. What is the 95% confidence interval estimate of the variance of car rental rates for the population?
c. What is the 90% confidence interval estimate of the standard deviation for the population?
Interval estimations about two population means, standards deviations, see Text. Chapter 10
Lecture 19
Tests of hypothesis
In general, we do not know the true value of population parameters (mean, proportion, variance,
SD and others). They must be estimated based on random samples. However, we do have
hypotheses about what the true values are.
The major purpose of hypothesis testing is to choose between two competing hypotheses about
the value of a population parameter.
It is needed then to define another hypothesis, called the alternative hypothesis, which is the
opposite in H0. It is denoted by Ha or H1.
Both the null and alternative hypothesis should be stated before any statistical test of significance
is conducted.
In general, it is most convenient to always have the null hypothesis contain an equal sign, e.g.
In general, a hypothesis tests about the values of the population mean take one of the following
three forms:
Problem 1
The manager of an automobile dealership is considering a new bonus plan designed to increase
sales volume. Currently, the mean sales volume is 14 automobiles per month. The manager wants
to conduct a research study to see whether the new bonus plan increases sales volume. To collect
data on the plan, a sample of sales personnel will be allowed to sell under the new bonus plan for
a 1-month period. Define the null and the alternative hypotheses.
Problem 2
The manager of an automobile dealership is considering a new bonus plan designed to increase
sales volume. Currently, the mean sales volume is 14 automobiles per month. The manager wants
to conduct a research study to see whether the new bonus plan decreases sales volume. To collect
data on the plan, a sample of sales personnel will be allowed to sell under the new bonus plan for
a 1-month period. Define the null and the alternative hypotheses.
Problem 3
The manager of an automobile dealership is considering a new bonus plan designed to increase
sales volume. Currently, the mean sales volume is 14 automobiles per month. The manager wants
to conduct a research study to see whether the new bonus plan changes sales volume. To collect
data on the plan, a sample of sales personnel will be allowed to sell under the new bonus plan for
a 1-month period. Define the null and the alternative hypotheses.
2. Specify the level of significance, , which defines unlikely values of sample statistic if the null
hypothesis is true. It is selected by the researcher at start. The common values of are 0.01, 0.05
and 0.10 and is most common 0.05.
3. Select the test statistic (a quantity calculated using the sample values that is used to perform the
hypothesis test) that will be used to test the hypothesis.
4. Use to determine the critical value (A boundary values that define the critical region from the
non-critical region or acceptance region. Based upon given risk level ) for the test statistic and
state the rejection rule for H0.
Critical region (CR) or rejection region (RR) are the area of the test statistic for which H 0 is false.
Non-critical region or acceptance region (AR) are the area of the test statistic for which H0 is true.
5. Collect the sample data and compute the value of the test statistic.
6. Use the value of the test statistic and the rejection rule to determine whether to reject H 0.
The probability when H0 is true, of obtaining a sample result that is at least as unlikely as what is observed.
More clearly, the p-value is a measure of the likelihood of the sample results when H0 is assumed to be true.
The smaller the p-value, the less likely it is that the sample results came from a situation whether the H 0 is
true. It is often called the observed level of significance. The user can then compare the p-value to and
draw a hypothesis test conclusion without referring to a statistical table):
Problem-4
Individuals filing federal income tax returns prior to March 31 had an average refund of $1056.
Consider the population of last minute filers who mail their returns during the last 5 days of the
income tax period typically April 10 to April 15. A researcher suggests that one of the reasons
individuals wait until the last 5 days to file their returns is that on average those individuals have
a lower refund than early fillers.
a) Develop appropriate hypotheses such that rejection of null hypothesis will support the
researcher’s argument.
b. Using 5% level of significance, what is the critical value for the test statistic and what is the
rejection rule?
Solution
Denote X - Individuals federal income tax returns prior to March 31. Here n = 400, 𝑥̅ = $910 and
= $1600.
(b) We find that n > 30, choose the z-statistic. The critical value of the z-statistic at the 5% level
of significance is found from the z table is -1.645.
(d) Conclusion
Thus, it is possible to conclude that we are 95% confident that we may reject the null hypothesis
and alternatively accept the alternative hypothesis. More clearly, based on sample evidence, it may
be concluded that the researchers claim is true that means individuals filing federal income tax
returns between April 10 to April 15 had an average refund of lower than $1056.
Individuals filing federal income tax returns prior to March 31 had an average refund of $1056.
Consider the population of last minute filers who mail their returns during the last 5 days of the
income tax period typically April 10 to April 15. A researcher suggests that one of the reasons
individuals wait until the last 5 days to file their returns is that on average those individuals have
grater refund than early fillers.
a) Develop appropriate hypotheses such that rejection of null hypothesis will support the
researchers argument.
b. Using 5% level of significance, what is the critical value for the test statistic and what is the
rejection rule?
c. For a sample of 400 individuals who filed a return between April 10 and April 15, the sample
mean refund was $910 and the sample standard deviation was $1600. Compute the value of the
test statistic.
Solution
Denote X - Individuals federal income tax returns prior to March 31. Here n = 400, 𝑥̅ = $910 and
= $1600.
(b) We find that n > 30, choose the z-statistic. The critical value of the z-statistic at the 5% level
of significance is found from the z table is 1.645.
Thus, it is possible to conclude that we are 95% confident that we may accept the null hypothesis
and alternatively reject the alternative hypothesis. More clearly, based on sample evidence, it may
be concluded that the researchers claim is false that means individuals filing federal income tax
returns between April 10 to April 15 had an average refund no greater than $1056.
Problem- 6
Individuals filing federal income tax returns prior to March 31 had an average refund of $1056.
Consider the population of last minute filers who mail their returns during the last 5 days of the
income tax period typically April 10 to April 15. A researcher suggests that one of the reasons
individuals wait until the last 5 days to file their returns is that on average those individuals have
changed refund than early fillers.
a) Develop appropriate hypotheses such that rejection of null hypothesis will support the
researchers argument.
b. Using 5% level of significance, what is the critical value for the test statistic and what is the
rejection rule?
c. For a sample of 400 individuals who filed a return between April 10 and April 15, the sample
mean refund was $910 and the sample standard deviation was $1600. Compute the value of the
test statistic.
Solution
Denote X - Individuals federal income tax returns prior to March 31. Here n=400, 𝑥̅ =$910 and
=$1600.
(d) Conclusion
Thus, it is possible to conclude that we are 95% confident that we may accept the null hypothesis
and alternatively reject the alternative hypothesis. More clearly, based on sample evidence, it may
be concluded that the researchers claim is false that means individuals filing federal income tax
returns between April 10 to April 15 had an average refund not changes from $1056.
Practice problem
The Edison Electric Institute has published figures on the annual number of kilowatt-hours
expanded by various home appliances. It is claimed that a vacuum cleaner expends an average of
46 kilowatt-hours per year. If a random sample of 42 homes included in a planned study indicates
that vacuum cleaners expend an average of 42 kilowatt-hours per year with a SD of 11.9 kilowatt-
hours, does this suggest at the 0.10 level of significance that vacuum cleaners expend, on the
average, less than 46 kilowatt-hours annually? Assume that the population of kilowatt-hours to be
normal.
Guideline
X- Number of kilowatt-hours expanded for homes for vacuum cleaners. Here n =42, 𝑥̅ =42 and
SD = 11.9.
We find that n>30, choose the z-statistic. The critical value of the z-statistic at the 5% level of
significance is found from the z table is 1.645.
Problem-7
Individuals filing federal income tax returns prior to March 31 had an average refund of $1056.
Consider the population of last minute filers who mail their returns during the last 5 days of the
income tax period typically April 10 to April 15. A researcher suggests that one of the reasons
individuals wait until the last 5 days to file their returns is that on average those individuals have
a lower refund than early fillers.
a) Develop appropriate hypotheses such that rejection of null hypothesis will support the
researchers argument.
b. Using 5% level of significance, what is the critical value for the test statistic and what is the
rejection rule?
c. For a sample of 10 individuals who filed a return between April 10 and April 15, the sample
mean refund was $910 and the sample standard deviation was $1600. Compute the value of the
test statistic.
Solution
Denote X - Individuals federal income tax returns prior to March 31. Here n = 10, 𝑥̅ = $910 and
= $1600.
(b) We find that n30, choose the t-statistic. The critical value of the t-statistic at the 5% level of
significance with 9 df is found from the t table is -1.833.
(d) Conclusion
Thus, it is possible to conclude that we are 95% confident that we may accept the null hypothesis
and alternatively reject the alternative hypothesis. More clearly, based on sample evidence, it may
be concluded that the researchers claim is false that means individuals filing federal income tax
returns between April 10 to April 15 had an average refund is not lower than $1056.
Practice problem
Joan’s Nursery specializes in custom-designed landscaping for residential areas. The estimated
labor cost associated with a particular landscaping proposal is based on the number of plantings of
trees, shrubs and so on to be used for the project. For cost-estimating purposes, managers use 2
hours of labor time for the planting of a medium-size tree. Actual times from a sample of 15
plantings during the past month follow (time in hours): 1.9, 1.7, 2.8, 2.4, 2.6, 2.5, 2.8, 3.2, 1.6 and
2.5. Using the 0.05 level of significance, test to see whether the mean tree planting time exceeds 2
hours?
Guideline:
X- Tree planting time. Here n=15, mean = 2.4 and SD = 0.52 (Used calculator to find it)
We find that n30, choose the t-statistic. The critical value of the t-statistic at the 5% level of
significance with 14 df is found from the t table is 1.761.
Test Statistic:
2 = (n-1)s2/20, where 20 is the hypothesized value for the population variance.
Problem 8
A Fortune study found that the variance in the number of vehicles owned or leased by subscribers
to Fortune magazine is 0.94. Assume a sample of 12 subscribers to another magazine provided the
following data on the number of vehicles owned or leased: 2, 1, 2, 0, 3, 2, 2, 1, 2, 1, 0 and 1. a.
Compute the sample variance in the number of vehicles owned or leased by the 12 subscribers. B.
Test the hypothesis H0: 2 = 0.94 to determine if the variance in the number of vehicles owned or
leased by subscribers of the other magazine differ from 2 = 0.94 for Fortune. Using a 0.05 level
of significance, what is your conclusion?
Solution
Denote X –The number of vehicles owned or leased by subscribers of Fortune magazine. Here n
= 12, sample variance s2= 0.81.
Note that the alternative is two-sided so we should get two rejections regions in both the lower and
the upper tails of the sampling distribution.
Test statistic: 2-statistic. With H0: 2 = 0.94, the value of the 2 statistic is computed as (n-1)s2/20
= (11x0.81)/0.94 = 9.478.
The critical values of the 2 statistic at the 5% level of significance will be 20.975 and 20.025
respectively. Using 11 degrees of freedom, the critical values are found from the 2 table are 20.975
= 3.815 and 20.025 =21.920 respectively.
Thus, it is possible to conclude that we are 95% confident that we may accept the null hypothesis.
More clearly, based on sample evidence, it may be concluded that the variance in the number of
vehicles owned or leased by subscribers of the other magazine do not differ from the claim for
Fortune.
Practice problem
Home mortgage interest rates for 30-year fixed rate loans vary throughout the country. During the
summer of 2000, data are available from various parts of the country suggested that the SD of the
interest rates was 0.096. The corresponding variance in interest rates would be 0.0092. Consider a
follow up study in the summer of 2003. The interest rates for 30-years fixed rate loans at a sample
of 20 lending institutions had a sample SD of 0.114. Conduct a hypothesis test H 0: 2 = 0.0092 to
see whether the sample data indicate that the variability in interest rates decreased. Using the 0.01
level of significance, what is your conclusion?
Guideline
X- Home mortgage interest rates for 30-year. Here n = 20, sample SD=0.114, population SD =
= 0.096, population variance = 2 = 0.0092 and = 0.01
Test statistic: 2-statistic. With H0: 2 0.0092, the value of the 2 statistic is computed as (n-
1)s2/20 = (190. 1142)/0.0092 = 26.83
Decision:
Thus, it is possible to conclude that we are 99% confident that we may accept the null hypothesis.
More clearly, based on sample evidence, it may be concluded the variability in interest rates
increased.
Population Mean () Test Population Proportion (P) test Population SD () Test
i) H0: = 5 vs. H1: 5 i) H0: P = 0.6 vs. H1:P 0.6 i) H0: =1.5 vs. H1: 1.5
ii) H0: 5 vs. H1: < 5 ii) H0:P 0.6 vs. H1:P<0.6 ii) H0: 1.5 vs. H1: <1.5
iii) H0: 5 vs. H1: > 5 iii) H0:P 0.6 vs. H1:P > 0.6 iii) H0: 1.5 vs. H1: >1.5
n (x Ho ) Zcal: (n 1)S 2x
Zcal = (large 2 , where S 2x is the
x p̂ PH0 PH0 (1 PH0 ) 2H 0
, where p
sample test n>30) p n sample variance.
or Distribution: Distribution:
Tests of two populations means, two standard deviations, Applications from real data
i) H0: 1 = 2 vs. H1: 1 2 i) H0: P1 = P2 vs. H1:P1 P2 i) H0: 1 = 2 vs. H1: 1 2
ii) H0: 1 2 vs. H1: 1 < 2 ii) H0: P1 P2 vs. H1:P1<P2 ii) H0: 1 2 vs. H1: 1<2
iii) H0: 1 2 vs. H1: 1 > 2 iii) H0:P1 P2 vs. H1:P1 > P2 iii) H0: 1 2 vs. H1: 1>2
Note: i) Two-sided or two-tailed tests and the other two’s are one-sided or one-tail lower or upper
tests.
Correlation Analysis
1) Scatter Diagram – To guess relationship between two variables
2) Correlation coefficient (rxy) will indicate us percent of relation exists between two
variables.
Let’s consider the following problem to understand it very clearly!
Problem
Find the relationship between two variables and make a summary based on your findings.
Solution:
Denote x - No. of TV commercials and y- Total sales because it is believable that sales depends
on No. of commercials
Make a shape of Scatter diagram to see what sorts of relation exist between and x and y.
70
60
50
40
Total Sales
30
20
10
0
0 1 2 3 4 5 6
No. of TV Commercials
Summary: We see that there is a positive relation exists between no. of TV commercials and total
sales.
To understand very clearly what percent relation exist between x and y, we will apply the following
formula (known as correlation coefficient) is defined as
1
where 𝑐𝑜𝑣(𝑥, 𝑦) = 𝑛−1 ∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝑛
1
𝜎𝑥 = 𝑠𝑞𝑟𝑡( ∑(𝑥𝑖 − 𝑥̅ )2 )
𝑛−1
𝑖=1
1
𝜎𝑦 = 𝑠𝑞𝑟𝑡(𝑛−1 ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 )
𝑥̅ = (∑ 𝑥𝑖 )/𝑛
𝑖=1
𝑦̅ = (∑ 𝑦𝑖 )/𝑛
𝑖=1
Make the following calculation table (details Must see Textbook, pp.115-116) to find rxy
No. of TV Total
Commercials(x) Sales(y) (𝑥𝑖 − 3)2 (𝑦𝑖 − 51)2 (𝑥𝑖 − 3)(𝑦𝑖 − 51)
2 50 1 1 1
5 57 4 36 12
1 41 4 100 20
3 54 0 9 0
4 54 1 9 3
1 38 4 169 26
5 63 4 144 24
3 48 0 9 0
4 59 1 64 8
2 46 1 25 5
Total 30 510 20 566 99
𝑐𝑜𝑣(𝑥,𝑦)
𝑟𝑥𝑦 = 𝜎𝑥 𝜎𝑦
= 11/(1.49x7.93) =0.9310
Summary
We see that 𝑟𝑥𝑦 =0.93 means that when no. of TV commercials increases there is a 93% chance that total
sales may be increased.
Regression Analysis
Fitting a model:
α =constant
e = random error
Here there are two parameters α and β. These two will be estimated based on random samples data.
Using the Ordinary Least Square method, we find that estimated values of α and β
𝛼̂ = 𝑦̅ − 𝛽̂ 𝑥̅
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝛽̂ =
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2
Estimated model y on x:
yi = 𝛼̂+𝛽̂xi , i = 1,2,….,n
Prediction or Forecasting
yp = 𝛼̂+𝛽̂xp
Solution:
α =constant
β = regression coefficient y on x
X = No. of commercials
e = random error
Two parameters α and β will be estimated based on random samples data y and x.
Calculation table
No. of TV Total
Commercials(x) Sales(y) (𝑥𝑖 − 3)2 (𝑦𝑖 − 51)2 (𝑥𝑖 − 3)(𝑦𝑖 − 51)
2 50 1 1 1
5 57 4 36 12
1 41 4 100 20
3 54 0 9 0
4 54 1 9 3
1 38 4 169 26
5 63 4 144 24
3 48 0 9 0
4 59 1 64 8
2 46 1 25 5
Total 30 510 20 566 99
𝛼̂ = 𝑦̅ − 𝛽̂ 𝑥̅
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅)
𝛽̂ =
∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2
𝛽̂ = 99/20= 4.95
Summary
𝛼̂=36.15 means that if there are no commercials (i.e. x=0), then expected sales may be 36.15$.
𝛽̂=4.95 means that when no. of TV commercials increases there is a chance that total sales may be
increased.
Thus yp =36.15+(4.95x5)=60.9$.
So, we can expect when there are 5 commercials in a week, company can expect total sales 60.9$.
HW: Text