Math 101 Course Notes
Math 101 Course Notes
Almost daily we apply statistical concepts in our lives. For example, to start the day you
turn on the shower and let it run for a few moments. Then you put your hand in the
shower to sample the temperature and decide to add more hot water or more cold water,
or conclude that the temperature is just right and enter the shower. As a second example,
you are at the grocery store looking to buy a frozen pizza. One of the pizza makers has a
stand, and they offer a small wedge of their pizza. After sampling the pizza, you decide
whether to purchase the pizza or not. In both the shower and pizza examples, you make a
decision and select a course of action based on a sample.
Definition of Statistics
• in its plural sense, statistics is a set of numerical data (e.g., annual GNP/GDP,
quarterly/monthly sales of a company, weekly/daily peso-dollar exchange rate)
• in its singular sense, Statistics is that branch of science which deals with the
collection, presentation, organization, analysis, and interpretation of data
Example In order to estimate the true proportion of students at a certain college who
smoke cigarettes, the administration polled a sample of 200 students and
determined that the proportion of students from the sample who smoke
cigarettes is 0.12. Identify the parameter and the statistic.
In 1662 John Graunt published an article “Natural and Political Observations Made upon
Bills of Mortality.” His “observations” were the result of his study and analysis of a
weekly church publication called “Bill of Mortality,” which listed births, christenings,
and deaths and their causes. This analysis and interpretation of social and political data
are thought to mark the start of statistics.
1
Fields of Statistics
a. Statistical Theory of Mathematical Statistics - deals with the development and
exposition of theories that serve as bases of statistical methods.
2
Below are some more examples of inferential statistics.
a) To examine the performance of the country’s financial system, we can use inferential
statistics to arrive at conclusions that apply to the entire economy using data gathered
from a sample of companies or businesses in the country.
b) To determine if reforestation is effective, we can take a representative portion of
denuded forests and use inferential statistics to draw conclusions about the effect of
reforestation on all denuded forests.
Classification of Variables
1. Qualitative (or Categorical) vs. Quantitative
Qualitative variable a variable that yields categorical responses (e.g.,
political affiliation, occupation, marital status)
Continuous variable a variable which can assume the infinitely many values
corresponding to a line interval
In 1824, a newspaper called the Hamsburg Pennsylvanian, conducted the first public
opinion poll, asking residents of Wilmington, Delaware which presidential candidate they
intended to vote for. The Pennsylvanian published the results, predicting that Andrew
Jackson would win. Jackson won the popular vote, but failed to win a majority of
electoral votes, and Congress picked John Quincy Adams to be president.
3
Levels of Measurement
1. Nominal Level (or Classificatory Scale)
Examples:
Sex M-Male F-Female
Marital status 1-Single 2-Married 3-Widowed 4-Separated
The ordinal level of measurement contains the properties of the nominal level,
and in addition, the numbers assigned to categories of any variable may be ranked
or ordered in some low-to-high-manner.
Examples:
Teaching ratings 1-poor 2- fair 3-good 4-excellent
Year level 1-1st yr 2 – 2nd yr 3 – 3rd yr 4 – 4th yr
3. Interval Level
The interval level is that which has the properties of the nominal and ordinal
levels, and in addition, the distances between any two numbers on the scale are of
known sizes. An interval scale must have a common and constant unit of
measurement. Furthermore, the unit of measurement is arbitrary and there is no
“true zero” point.
Examples:
IQ
Temperature (in Celsius)
4. Ratio Level
The ratio level of measurement contains all the properties of the interval level,
and in addition, it has a “true zero” point.
.
Examples:
Age (in years)
No. of correct answers in an exam
4
Exercise: Identify the population under study and variable/s of interest.
a) The Office of Admissions is studying the relationship between the score in the
entrance examination during application and the general weighted average upon
graduation among graduates of the university from 2000 to 2005.
b) The research division of a certain pharmaceutical company is investigating the
effectiveness of a new diet pill in reducing weight on female adults.
c) The Department of Health is interested in determining the percentage of children
below 12 years old infected by the Hepatitis B virus in Metro Manila in 2006.
Heart disease is the most common cause of death in industrialized nations. In the US and
Canada, nearly 30% of deaths yearly are due to heart disease, mainly heart attacks. Does
regular aspirin intake reduce deaths from heart attacks? Harvard Medical School
conducted a landmark study to investigate. The 22,000 male physicians participating in
the study regularly took either (assignment was determined by flipping a coin) an aspirin
or a placebo (a pill with no active ingredient). Of those who took aspirin, 0.9% had heart
attacks during the study. Of those who took the placebo, 1.7% had heart attacks, nearly
twice as many.
Can you conclude that it’s beneficial for people to take aspirin regularly? Or could the
observed be explained by how it was decided which people would receive aspirin and
which would receive the placebo? For instance, might those who took aspirin have had
better results merely because they were healthier (or have better diet or exercise more
regularly), on the average, than those who took the placebo?
A TV exit poll used to project the election outcome reported that 53.1% of a sample of
3889 voters said they have voted for candidate A. Was this sufficient evidence to project
A as the winner, even though such information was available from such a small portion
of the more than 9.5 million voters?
If candidate A were actually going to lose the election, what’s the chance that he/she
would be supported by 53.1% of the exit poll voters? If the chance were extremely small,
we’d feel comfortable making the inference that A’s election was supported by majority
of all 9.5 million voters.
5
Chapter 2
Collection and Presentation of Data
2.1 PRELIMINARIES
Classification of Data
We now enumerate some agencies where a researcher can avail of primary data.
a) Central Bank is a primary source of data on banking and finance.
b) National Statistics Office is a primary source of data on population, housing,
and establishments.
c) Pulse Asia is a primary source of data on opinions or sentiments of the people
on current issues.
d) Bureau of Agricultural Statistics is a primary source of data on agriculture and
livestock.
2. External vs Internal
a. Internal data - information that relates to the operations and functions of the
organization collecting the data
b. External data - information that relates to some activity outside the organization
collecting the data
Example The sales data of SM is internal data for SM but external data for
any other organization such as Robinson’s.
6
2.2 DATA COLLECTION METHODS
1. Survey method - questions are asked to obtain information, either through self-
administered questionnaire or personal (or phone) interview
a) Pulse Asia conducted a sample survey on voter response to political ads in the
May 2013 election. Its respondents were selected registered voters who intend to
vote in the 2013 election.
b) The Department of Energy regularly conducts the Household Energy
Consumption Survey to measure the level and pattern of energy consumption at
the national and regional levels.
c) The Food and Nutrition Research Institute regularly conducts the National
Nutrition Survey that generates data on malnutrition, prevalence of anemia,
Vitamin A and iodine deficiencies, the nutrient intake/adequacies of the members
in the households.
7
2. Experimental method - a scientific investigation conducted under controlled
situations where treatments are applied, and their effects measured on the
response of interest to the experimenter. This is an excellent method of collecting
data for causation studies. If properly designed and executed, experiments will
reveal with a good deal of accuracy, the effect of a change in one variable on
another variable.
3. Observation method - makes possible the recording of behavior but only at the
time of occurrence (e.g., observing reactions to a particular stimulus, traffic count,
behavior of animals in wildlife or newborn babies in nursery).
Advantages:
• Observation is superior over survey method in collecting data for
nonverbal behavior. In a survey, the researcher may encounter all sorts of
difficulties such as deliberate denial or memory failure. On the other
hand, an observer can make filed notes that record the salient features of
the behavior, or may even record behavior in its totality via videotape.
• Observation is superior over experiment in the sense that behavior takes
place in its natural environment. However the presence of an observer
may possibly alter the true behavior of the subjects.
• The observer is able to conduct his study in the subject’s natural
environment, and is thus usually able to study over a much longer time
period than with either survey or experiment.
8
• Data collected using observation method are difficult to analyze.
Measurements in observational studies take the form of the observer’s
qualitative perceptions rather than the quantitative measures often used in
survey research or experimentation.
• There are certain characteristics of interest that cannot be observed such as
opinions and beliefs. Also, there are certain activities that subjects will
refuse to be observed.
• For filed studies that are conducted in the natural environment, the
observer might find it difficult to enter to enter such environments as
secret environments or private companies.
Possible sources:
a) The National Statistics Office is a major collector of data for both private and
government needs. It provides the public with basic data on various subject
matters such as household income and expenditure, housing, education, health,
employment, and others.
9
b) The National Statistical Coordination Board compiles data necessary for the
computation of the gross national product, gross domestic product, consumer
price index, and other indices.
c) The Department of Health is responsible for health statistics like prevalence of
diseases among infants and pregnant women, morbidity rates, family planning
methods, etc.
d) The Social Weather Station keeps a record of poll results, social issues, and
others.
e) Theses of graduate students contain data used in their statistical inquiry.
In a case-control study, “cases” who have a particular attribute or condition are compared
to “controls” who do not. The idea is to compare the cases and controls to see how they
differ on an explanatory variable of interest. In medical settings, the cases usually are
individuals who have been diagnosed with a particular disease. Researchers then identify
a group of controls who are as similar as possible to cases, except that they don’t have the
disease. For example, samples of male heart attack patients (cases) and other male
hospital patients (controls) were compared to the extent of baldness.
Clinical trials are experiments that study the effectiveness of medical treatments on actual
patients.
10
Definition Census or complete enumeration is the process of gathering information
from every unit in the population.
• not always possible to get timely, accurate and economical data
• costly, if the number of units in the population is too large
11
Definition Survey sampling is the process of obtaining information from the units in
the selected sample.
12
2.3 PROBABILITY AND NON-PROBABILITY SAMPLING
Convenience sampling
Thus, in conducting the survey, the researchers sought the assistance of doctors with
private clinics. When a patient consults one of these doctors and has AIDS, the social
scientists would interview this patient in return for a free-of-charge consultation. With
this method, the sample will include persons who consulted one of the appointed
physicians and volunteered to participate in the study to avail of the free consultation.
a) A researcher may use a particular district, province, or city to be the sample cluster in
representing their population of interest. For instance, the researcher can identify a
specific district of Quezon City whose households have the same profile in terms of the
socio-economic characteristics as the households in the whole Quezon City.
b) For a study that aims to predict the senatorial winners in the national election, a
researcher may include in the sample the provinces that have voted for the actual winners
in a series of past senatorial elections.
13
We give an example of a government study using purposive sampling.
The Producer’s Price Survey of NSO is a nationwide undertaking intended to provide the
price data needed in the computation of the Producer’s Price Index for manufacturing. To
select the items included in the sample, NSO used purposive sampling by using a set of
criteria to identify the commodities for the market basket. Some of the criteria are: (i) the
commodity has relatively high market share; (ii) the commodity was available in the
market in the base year; and, (iii) the current production of the commodity; and the
market share of the commodity has been stable during the last three years based on the
NSO Annual Survey of Establishment reports.
A researcher wishes to study the people’s views on birth control. The researcher believes
that a person’s views on birth control and his religion are related. Census results showed
that 70% of the people in the population are Catholics, 20% are Protestants, and 10% are
Muslims. The researcher then selects a sample reflecting the same proportions to
represent the three groupings. If there should be 200 respondents in the sample then this
means that the quota set for each group are as follows: (i) Catholics - 70% of 200=140,
(ii) Protestants – 20% of 200 = 40, and, (iii) Muslim – 10% of 200 = 20. This is quota
sampling and not stratified sampling if the researcher leaves the selection of the 140
Catholics, 40 Protestants, and 20 Muslims to the discretion of the interviewers.
Shortly after Bill Clinton became President of the United States, a television station in
Sacramento, California asked viewers to respond to the question, “Do you support the
President’s economic plan?” The next day the result of a properly conducted study that
asked the same question were published in the newspaper.
During the Spanish Period, a census of the Philippines was taken several times, but the
Americans doubted their accuracy. Immediately after establishing the civil government,
William H. Taft set a national census on March 2, 1903 and declared it Census Day.
Two years later, the statistics gathered in 1903 were published in book form. The output
consisted of four volumes, and highlighted population and livelihood statistics.
Based on this census, the total population of the Philippines in 1903 was 7.6 million.
14
Methods of Probability Sampling
To do this, we first list down all the 30 members of the organization and assign a
unique serial number, from 01 to 30, to each one of them.
15
Table of Random Numbers
00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31 32-35
00 4103 5778 4099 4089 2236 1361 5612 5858 4155
01 7786 4358 4934 9335 3397 3345 1507 0814 0066
02 7654 7803 4234 2322 0129 3253 0275 6836 2185
03 9655 4260 5253 1509 3752 0033 0091 0905 1468
04 5696 1350 9977 7147 8347 7317 9233 8409 3032
05 0803 0281 0159 9634 6566 1766 4195 6427 9168
06 7686 4882 1689 5058 7234 0736 2745 1171 8456
07 4794 1204 6465 4569 3882 2388 2520 6216 0422
08 7037 8610 0584 6101 5070 8476 4118 0783 3639
09 8983 6597 2170 0685 7814 5426 5695 6792 7673
10 8960 3638 7791 1494 2158 0141 3176 2025 4677
11 5931 4049 3766 0345 5865 4833 8357 0211 0240
12 1202 5203 3956 6740 1958 1596 6633 2408 2446
13 6260 3898 8687 7694 1242 7541 8720 4938 9196
14 0364 3201 0251 5461 3231 2830 9935 0924 8650
15 4572 3577 2706 4717 2038 1440 9125 6479 3731
16 9291 4477 1367 6456 7869 0190 8694 6236 6131
17 2377 5010 6496 2096 2648 0015 1567 5608 6394
18 3254 5512 9426 4582 2983 4365 1314 3668 4344
19 4682 2050 9419 3621 3136 3683 3030 5798 8838
20 5057 5249 9688 3653 5955 4694 1707 7437 6956
21 9983 6640 7507 1631 6683 4144 3336 6913 6167
22 2329 1180 0219 5456 8229 0172 7285 6811 0659
23 0370 5889 8506 5009 6501 3894 2396 6676 6389
24 1813 3784 1475 9608 9697 4478 9921 5364 8896
25 0185 3219 8044 5119 5448 5960 4397 4139 9267
26 8811 7537 4068 2362 4012 3407 2482 5714 5588
27 5984 0989 2803 4479 6081 9657 4600 1828 9219
28 2035 8234 3506 3649 3511 1842 6078 7935 7862
29 0677 3199 0161 8660 9495 1640 6736 5648 2017
30 6343 9781 5862 7606 8359 6610 1028 4987 2845
31 8870 0077 6080 2682 4846 9842 4408 4693 6444
32 9373 5887 9700 9074 3647 9086 3264 9367 3325
33 2910 8091 5165 4562 2599 6184 8283 2732 8337
34 9122 4000 1643 5485 1897 9943 0010 2284 8130
Advantages
• The theory involved is much easier to understand than the theory behind other
sampling designs.
• Inferential methods are simple and easy.
Disadvantages
• The sample chosen may be widely spread, thus entailing high transportation costs.
• A population frame, or list, is needed.
• Less precise estimates result if the population is heterogeneous with respect to the
characteristic under study.
16
Below are some examples of simple random sampling.
Advantages
• It is easier to draw the sample and often easier to execute without mistakes than
simple random sampling.
• It is possible to select a sample in the field without a sampling frame.
• The systematic sample is spread more evenly over the population.
17
Disadvantages
• If periodic regularities are found in the list, a systematic sample may consist only
of similar types. (Example: Store sales over seven days of the week – estimating
total sales based on a systematic sample every Tuesday would be unwise.)
• Knowledge of the structure of the population is necessary for its most effective
use.
Example Suppose we wish to select a sample of farms to estimate the total farm
production. If we have a list of farms with their corresponding sizes in
square meters, we can arrange the farms first according to size before we
select our systematic sample.
Stratified Sampling
Let us select a sample from the same population used in the previous Example but
this time we will use stratified sampling. The population has N=30 members of an
organization and the sample size is n=10 members. If the stratification variable is sex
then we would partition the population into two strata: (i) Stratum 1 – Males, and (ii)
Stratum 2 – Females. One way of allocating the 10 units in the sample is to distribute
them equally into the two strata. Thus, we will select n1=5 males and n2=5 females.
MALES FEMALES
01 Almeda, Joel 01 Abad,Melissa 12 Querido, Rose
02 Baluyot, Temy 02 Conlin, Juliet 13 Quiambao, Gina
03 Cruz, Raks 03 Corpuz, Joan 14 Quidayan, Candy
04 Fuentes,Mar 04 Dayrit, Erlyn 15 Santos, Emily
05 Lanuza, Jon 05 Diaz, Aurora 16 Tablante, Rita
06 Macasaet, Erwin 06 Foz, Vivian 17 Tolentino,Magda
07 Peña, Lito 07 Gomez,May 18 Tuason, Joy
08 Quebral, Joseph 08 Joson, Sonia 19 Zamora, Bea
09 Surla,Michael 09 La Pierre, Amy
10 Valdez, Ernie 10 Le, Diana
11 Venegas, Anthony 11 Macaibay,Macky
18
Advantages
• Stratification may produce a gain in precision in the estimates of characteristics of
the population
• It allows for more comprehensive data analysis since information is provided for
each stratum.
• It is administratively convenient.
Disadvantages
• A listing of the population for each stratum is needed.
• The stratification of the population may require additional prior information about
the population and its strata.
Suppose we want to get the opinion of business administration college students regarding
premarital sex. A good stratification variable is sex because the views of the males may
be very different from the views of the females. The population consists of N =500
business administration students and the sample size is n=50. Out of the 500, there are
300 female and 200 male students. The list of business administration students, together
with their respective sex, is available at the records section of the college, or at the Office
of the Registrar.
The Business Expectations Survey is a nationwide survey, which the Bangko Sentral ng
Pilipinas conducts every semester. The survey provides information useful to policy
makers and monetary managers for their economic and financial policy planning. It
presents data on the general perceptions of the business sector on the current state of
business and the economic prospects for the succeeding semester, and it computes
indicators of economic activity.
In the 2000 BES, the sampling frame was the Securities and Exchange Commission list
of the Philippines’ Top 2000 Corporations. BSP stratified the firms in the list according
to the nine industry groups of the Philippine Standard Industry Classification. This allows
the representation of each industry group in the sample. BES selected the sample of firms
from each industry group using systematic sampling.
19
Cluster Sampling
Clusters may be of equal or unequal size. When all of the clusters are of the same size,
the number of elements in a cluster will be denoted by M while the number of clusters in
the population will be denoted by N.
Sample-Selection Procedure
Step 1 Number the clusters from 1 to N.
Step 2 Select n numbers from 1 to N at random. The clusters corresponding to the
selected numbers form the sample of clusters.
Step 3 Observe all the elements in the sample of clusters.
Step 1: Decide on how to divide the population into non-overlapping clusters. In this
example, we will use the barangays as the clusters so that the elementary units are the
households but the sampling units are the barangays.
Step 2: Get a list of all barangays in Mandaluyong City. Number the barangays in the
list, consecutively from 1 to 27.
20
Step 3: Suppose we decide to include n=5 clusters in the study. Use the table of
random numbers to obtain 5 distinct numbers less than or equal to 27.
Advantages
• A population list of elements is not needed; only a population list of clusters is
required. Listing cost is reduced.
• Transportation cost is reduced.
Disadvantages
• The costs and problems of statistical analysis are greater.
• Estimation procedures are more difficult.
The National Statistics Office conducts the Census of Agriculture and Fisheries to collect
data from all agricultural and fishing operators, and all households engaged in
agricultural and fishing activities.
However, due to budgetary constraints, NSO was only able to collect sample data for the
1991 CAF. NSO used cluster sampling, where the barangays served as the clusters. For
each city/municipality, NSO prepared a list of barangays arranged in descending order,
according to the total farm area in the whole barangay. From this list, NSO selected a
sample of barangays using systematic sampling. All agricultural and fishing operators
and all households engaged in agricultural and fishing activities in the selected barangays
were included in the study. In the end, NSO included a total of 5,997,427 operators and
households for this study.
Multistage Sampling
Advantages
• Listing cost is reduced.
• Transportation cost is reduced.
Disadvantages
• Estimation procedure is difficult, especially when the primary stage units are not
of the same size.
21
• Estimation procedure gets more complicated as the number of sampling stages
increases.
• The sampling procedure entails much planning before selection is done.
We now present an actual survey that used two-stage sampling in selecting the sample of
elements.
The Food and Nutrition Research Institute of the Department of Science and Technology
conducts the National Nutrition Survey every 5 years. This survey aims to determine the
prevalence of malnutrition and specific health problems in the country and to provide
data on food consumption and nutrient intake.
For this study, FNRI used 2-stage sampling to select a sample of individuals from each
province. The primary stage units are the barangays. The second stage units are the
individuals.
We now present an actual survey that used three-stage sampling in selecting the sample
of elements.
The Department of Tourism conducts the Visitor Sample Survey every month. This
survey aims to collect data on the demographic profile, travel characteristics and
preferences of foreign and overseas Filipinos who visited the country for tourism
development planning and policy-making purposes.
DoT selects the sample of visitors using three-stage sampling. The primary stage units are
the weeks of the month. The second-stage units are the weekly flights. The third-stage
units are the visitors.
For this monthly survey, DoT selects the week of the month using simple random
sampling. From the selected week, they select a sample of weekly flights. They perform
this using stratified random sampling. DoT stratified all the regular weekly international
flights leaving the different international airports in the Philippines according to country
market. It then selects a sample of flights from each country market using simple random
sampling. From the selected flights, DoT selects a sample of visitors using simple
random sampling.
22
23
Read on The Questionnaire (Optional)
• Strategies in Writing the Questions (Closed- vs. Open-ended questions)
• Pitfalls to Avoid in Wording Questions
• Ways to Avoid Irrelevant Questions
• Question Order
• Cover Letter/ Introduction
• Pretest
24
2.4 TABULAR AND GRAPHICAL PRESENTATION OF DATA
Textual Presentation
• data incorporated to a paragraph of text
Example
The 2013 Young Adult Fertility Study Findings from the 2013 Young Adult
(YAFS 4) conducted by the Demographic Fertility and Sexuality Study (YAFS 4)
Research & Development Foundation and …show that the levels of current drug use,
the University of the Philippines drinking alcohol and smoking among
Population Institute shows that 32 percent young people aged 15-24 have dropped
of young Filipinos between the ages 15 to considerably. The declining pattern is
24 have had sex before marriage. Of these, found in the practices of both young men
78 percent reported that their first sexual and women, as well as in younger and
encounter was unprotected: 84 percent older youth.
among young women and 73 percent
among young men. The percentage of young people who are
“current smokers” declined from 20.9
The same study also found that 7.3 percent percent in 2002 to 19.7 percent in 2013.
have engaged in casual sex while 3.5 Eleven years ago, 41 percent of young
percent have had regular sex without Filipinos reported to be “current alcohol
emotional attachment (FUBU). Five drinkers”. Now, 37 percent of young adults
percent of young men disclosed having are engaged in this behavior. But the most
experienced sex with another man (MSM). substantial decline is found in drug use.
Among individuals who are either formally Only 4 percent admitted to have ever used
married or in a live-in arrangement, 3 drugs in 2013, compared to almost 11
percent said they ever had an extra-marital percent in 2002.
affair.
The National Capital Region has the
Regional difference in premarital sex highest level of youth smokers (27 percent)
prevalence shows the National Capital while ARMM registered the lowest. Only
Region (NCR) having the highest 12 percent of young people in ARMM are
prevalence at 41 percent and ARMM, the smokers.
lowest (7.7 percent).
Advantages
• gives emphasis to significant figures and comparisons
• simplest and most appropriate approach when there are only a few numbers to be
presented
Disadvantages
• when a large mass of quantitative data are included in a text or paragraph, the
presentation becomes almost incomprehensible
• written paragraphs can be tiresome to read especially if the same words are repeated so
many times
25
Tabular Presentation
• the systematic organization of data in rows and columns
Advantages
• more concise than textual presentation
• easy to understand
• facilitates comparisons & analysis of relationship among different categories
• presents data in greater detail than a graph
2. Box Head - the portion of the table that contains the column heads which describe the
data in each column, together with the needed classifying and qualifying spanner heads.
3. Stub - the portion of the table usually comprising the first column on the left, in which
the stubhead and row captions, together with the needed classifying and qualifying center
head and subheads are located. The stubhead describes the stub listing as a whole in
terms of the classification presented. The row caption is a descriptive title of the data on
the given line.
4. Field - main part of the table; contains the substance or the figures of one’s data
5. Source note - an exact citation of the source of data presented in the table (should
always be placed when the figures are not original)
Guidelines
• The title should be concise, written in telegraphic style, not in complete sentence.
• Column labels should be precise. Stress differences rather than similarities between
adjacent columns. As much as possible, two or more adjacent columns should not begin
nor end with the same phrase. This is frequently a signal that a spanner head is needed.
• The arrangement of lines in the stub depends on the nature of classification, purpose of
presentation or limitations of space.
• Categories should not overlap.
• The units of measure must be clearly stated.
• Show any relevant total, subtotals, percentages, etc.
• Indicate if the data were taken from another publication by including a source note.
• Tables should be self-explanatory, although they may be accompanied by a paragraph
that will provide an interpretation or direct attention to important figures.
26
27
28
Graphical Presentation
• a graph or chart is a device for showing numerical values or relationships in pictorial
form
Advantages
• main features and implications of a body of data can be grasped at a glance
• can attract attention and hold the reader’s interest
• simplifies concepts that would otherwise have been expressed in so many words
• can readily clarify data, frequently bring out hidden facts and relationships
29
Common Types of Graph
1. Line Chart - graphical presentation of data especially useful for showing trends over a
period of time.
2. Pie Chart - a circular graph that is useful in showing how a total quantity is distributed
among a group of categories. The “pieces of the pie” represent the proportions of the total
that fall into each category.
3. Bar Chart - consists of a series of rectangular bars where the length of the bar
represents the quantity or frequency for each category if the bars are arranged
horizontally. If the bars are arranged vertically, the height of the bar represents the
quantity.
4. Pictorial unit chart – a pictorial chart in which each symbol represents a definite and
uniform value
30
2.5 THE FREQUENCY DISTRIBUTION TABLE
Definition. The raw data is the set of data in its original form.
50 57 63 69 72 74 77 80 82 84 87
50 59 65 69 72 75 77 80 82 84 87
50 59 66 69 72 75 77 80 82 85 88
50 60 66 69 72 75 77 81 83 85 89
50 60 68 70 73 75 78 81 83 86 89
50 60 68 71 73 75 79 81 84 86 91
51 62 68 71 73 76 79 81 84 87 92
52 62 68 71 73 76 79 82 84 87 94
53 62 68 71 74 76 79 82 84 87 94
53 62 69 72 74 76 79 82 84 87 96
Suppose we have data on number of children of 50 married women using any modern
contraceptive method.
0 0 1 2 2 2 3 3 4 4
0 0 1 2 2 3 3 3 4 4
0 1 1 2 2 3 3 3 4 4
0 1 1 2 2 3 3 3 4 5
0 1 1 2 2 3 3 3 4 5
Since there are only 6 unique values in the data set then we use
single-value grouping.
31
In the construction of a frequency distribution, the various items of a series are classified
into groups. The frequency distribution table shows the number of items falling into
each group.
Definition of terms
1. Class interval - the numbers defining the class
2. Class limits - the end numbers of the class
3. Open-end class - a class that has no lower limit or upper limit
4. Class frequency - the number of observations falling in the class
5. Class size - the difference between the upper class limits of the class and the preceding
class; can also be computed as the difference between the lower class limits of the next
class and the class
2. Determine the approximate class size. Whenever possible, all classes should be of the
same size. The following steps can be used to determine the class size.
• Solve for the range, R = max – min.
• Compute C’ = R ÷ K.
• Round-off C’ to the same number of decimal places as the original dataset, say
C, and use C as the class size.
3. Determine the lowest class limit. The first class must include the smallest value in the
data set and must agree with the number of decimal places in the dataset.
32
4. Determine all class limits by adding the class size, C, to the limit of the previous class.
5. Tally the frequencies for each class. Sum the frequencies and check against the total
number of observations.
33
1. Histogram - a bar graph that displays the classes on the horizontal axis and the
(relative) frequencies (percentage) of the classes on the vertical axis; the vertical lines of
the bars are erected at the class boundaries and the height of the bars correspond to the
class (relative) frequency (percentage)
CB f
49.5-54.5 10 25
54.5-59.5 3 20
59.5-64.5 8 15
64.5-69.5 13 10
69.5-74.5 17
5
74.5-79.5 19
0
79.5-84.5 22
84.5-89.5 13
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
.5
49
54
59
64
69
74
79
84
89
94
99
89.5-94.5 4
94.5-99.5 1
62 .07
0.15
67 .12
72 .15 0.1
77 .17
82 .20 0.05
87 .12
0
92 .04 47 52 57 62 67 72 77 82 87 92 97 102
97 .01
34
3. Ogives - graphs of the cumulative frequency distribution
a. < ogive - the <CF is plotted against the UCB
b. > ogive - the >CF is plotted against the LCB
120
UCB <CF
54.5 10
100
59.5 13
80 64.5 21
69.5 34
60
74.5 51
40 79.5 70
20
84.5 92
89.5 105
0
94.5 109
49.5 54.5 59.5 64.5 69.5 74.5 79.5 84.5 89.5 94.5 99.5
99.5 110
59.5 97 80
64.5 89
60
69.5 76
74.5 59 40
79.5 40
20
84.5 18
89.5 5 0
49.5 54.5 59.5 64.5 69.5 74.5 79.5 84.5 89.5 94.5 99.5
94.5 1
The stem-and-leaf display is an alternative method for describing a set of data. It presents
a histogram-like picture of the data, while allowing the experimenter to retain the actual
observed values of each data point. Hence, the stem-and-leaf display is partly tabular and
partly graphical in nature.
In creating a stem-and-leaf display, we divide each observation into two parts, the stem
and the leaf. For example, we could divide the data value 234 as follows:
Stem Leaf
2 | 34
35
Alternatively, we could choose the point of division between the units and tens, whereby
Stem Leaf
23 | 4
The choice of the stem and leaf coding depends on the nature of the data set.
Example: Typing speeds (net words per minute) for 20 secretarial applicants
68 72 91 47
52 75 63 55
65 35 84 45
58 61 69 22
46 55 66 71
36
Exercises on FDT and SALD:
1. Americans are becoming increasingly concerned with the incidence of crime, and
voluminous data is being collected to document the magnitude of the problem.
The following table displays data on number of rapes per 100,000 residents for
the 50 states and the District of Columbia.
Florence Nightingale is known as the founder of the nursing profession. However, she
also saved many lives by using statistical analysis. When she encountered an unsanitary
condition or an undersupplied hospital, she improved the conditions and then used
statistical data to document the improvement. Thus, she was able to convince others of
the need for medical reform, particularly in the area of sanitation. She developed original
graphs to demonstrate that, during the Crimean War, more soldiers died from unsanitary
conditions than were killed in combat.
37
2. The paper “The Acid Rain Controversy: The Limits of Confidence” (Amer.
Statistician (1983): 385-394) presented data on average sulfur dioxide emission
rates for industrial and utility boilers in 47 states (data from Alaska, Hawaii, and
Idaho were not given).
.3 .9 1.5 2.5
.6 .6 1.4 2.7
.4 1.5 1.9 2.9
.5 1.5 1.0 2.1
.2 1.3 1.7 2.9
.7 1.2 1.8 3.8
.2 1.2 1.7 3.6
.7 1.0 1.8 3.4
.7 1.4 1.4 3.7
.5 1.0 2.3 4.2
.1 1.7 2.7 4.5
.6 1.5 2.2
3. The NSCB presented the following figures on the total deposits (in millions of
pesos) in the government banks in the different provinces in the Philippines in
2001.
38
Dataset 1
17.8 31.3 45.7
18.4 32.5 46.2
19.7 32.5 46.8
20.7 32.9 46.9
23.6 33.7 47
24.4 34 49.5
24.7 34.3 49.7
25.8 34.3 49.9
25.9 37.8 51.5
27.3 37.9 52.4
27.9 39.4 53.6
29 40.4 53.7
29.5 40.9 62.2
29.8 42.2 64
29.8 42.6 72.9
30 43.3 77.6
31 44.1 88.1
CI CI CI
17.0-27.4 7.4-17,8 17.8-28.2
27.5-37.9 17.9-28.3 28.3-38.7
: : :
80.0-90.4 : 80.8-91.2
k = n = 51 = 7.141428429
c’ = Range/k = 9.843968991
c = 9.8
1 78 84 97
2 07 36 44 47 58 59 73 79 90 95 98 98
3 00 10 13 25 25 29 37 40 43 43 78 79 94
4 04 09 22 26 33 41 57 62 68 69 70 95 97 99
5 15 24 36 37
6 22 40
7 29 76
8 81
39
Dataset 2
0.1 0.7 1.5 2.5
0.2 0.9 1.5 2.7
0.2 1 1.5 2.7
0.3 1 1.7 2.9
0.4 1 1.7 2.9
0.5 1.2 1.7 3.4
0.5 1.2 1.8 3.6
0.6 1.3 1.8 3.7
0.6 1.4 1.9 3.8
0.6 1.4 2.1 4.2
0.7 1.4 2.2 4.5
0.7 1.5 2.3
CI CI
-.5-.1 .1-.7
.2-.8 .8-1.4
: :
: 4.3-4.9
k= n = 47 = 6.855655
c’ = Range/k = .641806
c = .6
0 1 2 2 3 4 5 5 6 6 6 7 7 7 9
1 0 0 0 2 2 3 4 4 4 5 5 5 5 7 7 7 8 8 9
2 1 2 3 5 7 7 9
3 4 6 7 8
4 2 5
40
Dataset 3
74 329 839 1932
117 332 850 2071
128 347 906 2174
142 358 933 2174
151 368 1084 2430
154 403 1094 2438
157 434 1104 2530
167 437 1114 2544
170 440 1125 2732
198 448 1131 2775
214 450 1169 2809
215 471 1195 2826
218 520 1280 2846
235 528 1390 2999
241 569 1431 3345
245 604 1551 3502
268 622 1632 4515
285 645 1643 6632
304 654 1771
320 680 1880
CI CI
-825-74 74-973
75-974 974-1873
: :
: 6374-7273
k= n = 78 = 8.831761
c’ = Range/k = 742.5473
c = 743
41
42
0 074 117 128 142 151 154 157 167 170 198 214 215 218 235 241 245 268 285 304 320 329 332 347 358 36
1 084 094 104 114 125 131 169 195 280 390 431 551 632 643 771 880 932
2 071 174 174 430 438 530 544 732 775 809 826 846 999
3 345 502
4 515
5
6 632
43
44
CHAPTER 3
Measures of Central Tendency
and
Measures of Location
Definition. A measure of central tendency is any single value that is used to identify the
“center” of the data or the typical value. It is often referred to as the average.
Suppose that X is the variable of interest, and that n measurements are taken. The
notation X1, X2, . . . ,Xn will be used to represent the n observations.
Let the Greek letter (sigma) indicate the “summation of,” thus, we can write the sum of
n
the observations as X
i =1
i = X1 + X 2 + + X n
The numbers 1 and n are called the lower and the upper limits of summation,
respectively.
Example: i 1 2 3 4
Xi 2 4 6 8
Yi 1 2 1 2
Calculate:
3
1. X
i =2
i
3
2. (X
i =2
i + Yi )
4
3. X Y
i =1
i i
4 4
4. X i Yi
i =1 i =1
4
Xi
5.
i =1 Yi
4
X
i =1
i
6. 4
Y
i =1
i
45
Some Results on Summation
1. The summation of the sum (or difference) of variables is the sum (or difference) of
n n n
their summations. That is, ( X i Yi ) = X i Yi
i =1 i =1 i =1
n n
2. If c is a constant, then cX i = c X i
i =1 i =1
n
3. If c is a constant then c = nc
i =1
The population mean for a population with N elements, denoted by the Greek letter μ
N
X i
(mū), is computed as = i =1
N
n
X i
The sample mean X (X bar) of n observations is computed as X = i =1
n
The sample mean (a statistic) is an estimate of the unknown population mean (a
parameter).
Examples:
1. The number of employees at 5 different drug stores are 10, 12, 6, 8, and 4. Find the
mean number of employees for the 5 stores.
2. Scores in the Statistics 101 first exam for a sample of 10 students are as follows: 60,
55, 30, 90, 88, 79, 45, 66, 93, and 80. Find the mean.
3. Refer to the example on the final grades of 110 Statistics 101 students. The sample
mean is 74.10909091
46
Definition. The weighted mean is a modification of the usual mean that assigns weights
(or measures of relative importance) to the observations to be averaged. If each
observation Xi is assigned a weight Wi the weighted mean is given by
n
W X i i
X = i =1
n
W i =1
i
Examples:
1. Suppose a teacher assigns the following weights to the various course requirements:
Assignment 15%
Project 25%
Midterm Exam 20%
Final Exam 40%
The maximum score a student may obtain for each component is 100. Jeffry obtains
marks of 83 for assignments, 72 for the project, 41 for the midterm exam, and 47 for the
final exam. Find his mean mark (or final grade) for the course.
47
Characteristics of the Mean
1. It is the most familiar measure used, and it employs all available information.
2. It is affected by the value of every observation. In particular it is strongly influenced by
extreme values.
3. Since the mean is a calculated number, it may not be an actual number in the data set.
4. Suppose the true average fuel efficiency for all 600,000 cars of a certain type under
specified conditions might be µ = 27.5 mi/gal. A sample of n = 5 cars might yield
efficiencies 27.3, 26.2, 28.4, 27.9, 26.5 for which we obtain x = 27.26. However a
second sample might give x = 28.52, a third x = 26.85 and so on. The value of x varies
from sample to sample whereas there is just one value for µ. Later on we shall see how
the value of x from a particular sample can be used to draw various conclusions about
the value of µ.
5. It possesses two mathematical properties that will prove to be important in subsequent
analyses.
i) The sum of the deviations of the values from the mean is zero.
ii) The sum of the squared deviations is minimum when the deviations are taken
( )
n n
( X i − c ) = X i2 − n X
2 2
+n X −c
2
from the mean. Hint: show that
i =1 i =1
6. a. If a constant c is added (subtracted) to all observations, the mean of the new
observations will increase (decrease) by the same amount c.
b. If all observations are multiplied or divided by a constant, the new observations
will have a mean that is the same constant multiple of the original mean.
Example: Given 5 temperature readings measured in Fahrenheit: 98, 100, 107, 90, 92. If
the mean temperature is X F = 97.4, find the mean temperature in centigrade if
C = 95 (F − 32 )
48
Approximating the Mean from a Frequency Distribution
- possible only when the class mark can be assumed to be representative of all the values
in that class. If the assumption holds, the following equation may be used to approximate
the mean from a frequency distribution.
k k
f CM i i f CM i i
X = i =1
k
= i =1
f
n
i
i =1
where fi = the frequency of the ith class
CMi = the class mark of the ith class
k = total number of classes
n = total number of observations
f CM i i
8145
x= i =1
k
= = 74.045455
f
110
i
i =1
Remarks:
1. The formula for approximating the mean cannot be used if a frequency distribution has
open-ended intervals, unless there are reasonably accurate estimates of the class marks
for the open intervals.
2. The mean of a frequency distribution is simply a weighted mean of the class marks,
where the fi’s are the weights.
49
3.3 THE MEDIAN
- the positional middle of the arrayed data
- in an array, one-half of the values precede the median and one-half follow it
The first step in calculating the median, denoted as Md, is to arrange the data in an array.
Let X(i) be the ith observation in the array, i = 1, 2, . . . , n.
If n is odd, the median position equals (n+1)/2, and the value of the (n+1)/2 th
observation in the array is taken as the median, i.e.,
Md = X ( n +1 )
2
( )
If n is even, the mean of the two middle values in the array is the median, i.e.,
Md = 12 X ( n ) + X ( n +1)
2 2
Examples:
1. Given the following heights (in inches): 71, 72, 75, 75, and 67. Find the median
height.
2. Given the following scores: 1, 7, 3, 3, 6, 5, 4, 3, find the median of the scores.
3. Refer to the example on the grades of 110 Statistics 101 students. Find the median
.
Characteristics of the Median:
1. The median is a positional measure.
2. The median is affected by the position of each item in the series but not by the value of
each item. This means that extreme values affect the median less than the arithmetic
mean.
n2 − cf Md −1
Md = LCBMd + c
f Md
where LCBMd = the lower class boundary of the median class
c = class size of the median class
n = the total number of observations in the distribution
<CF Md - 1 = less than cumulative freq. of the class preceding the median class
fMd = frequency of the median class
50
Example:
Refer to the example on the final grades of 110 Statistics 101 students.
n − cf Md −1
Md = LCBMd + c 2
f Md
− 51
110
= 74.5 + 5 2 = 75.552632
19
The mode is determined by counting the frequency of each value and finding the value
with the highest frequency of occurrence.
Examples:
1. 2, 5, 2, 3, 5, 2, 1, 4, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 2, 2
2. 2, 5, 5, 2, 2, 5, 1, 3, 5, 4, 2, 5, 5, 2, 2, 5, 5, 2, 2, 1
3. 1, 2, 3, 3, 2, 1, 2, 3, 1, 4, 4, 5, 5, 1, 2, 3, 4, 5, 4, 5
4. Refer to the example on the final grades of 110 Statistics 101 students. Find the mode.
51
Approximating the Mode from the Frequency Distribution
Step 1: Locate the class with the highest frequency. This is the modal class
Step 2: Approximate the mode using the following formula:
f Mo − f Mo−1
Mo = LCBMo + c
2 f Mo − f Mo−1 − f Mo+1
where LCBMo = lower class boundary of the modal class
c = class size of the modal class
fMo = frequency of the modal class
fMo-1 = frequency of the class preceding the modal class
fMo+1 = frequency of the class following the modal class
Example:
Class Freq
50 – 54 10
55 – 59 3
60 – 64 8
65 – 69 13
70 – 74 17
75 – 79 19
80 – 84 22
85 – 89 13
90 – 94 4
95 – 99 1
f Mo − f Mo−1
Mo = LCBMo + c
2 f Mo − f Mo−1 − f Mo+1
22 − 19
= 79.5 + 5 = 80.75
2(22) − 19 − 13
52
3.5 MEASURES OF LOCATION
Definition. Quartiles are values that divide the array into 4 equal parts. Thus,
Q1, read as first quartile, is the value below which 25% of the values fall.
Q2, read as second quartile, is the value below which 50% of the values fall.
Q3, read as third quartile, is the value below which 75% of the values fall.
Lowest 25% Second lowest 25% Third lowest 25% Highest 25%
Q1 Q2 Q3
Definition. Deciles are values that divide the array into 10 equal parts. Thus,
D1, read as first decile, is the value below which 10% of the values fall.
D2, read as second decile, is the value below which 20% of the values fall.
•
•
•
D9, read as ninth decile, is the value below which 90% of the values fall.
10% 10% 10% 10% 10% 10% 10% 10% 10% 10%
D1 D2 D3 D4 D5 D6 D7 D8 D9
Definition. Percentiles are values that divide the array into 100 equal parts. Thus,
P1, read as first percentile, is the value below which 1% of the values fall.
P2, read as second percentile, is the value below which 2% of the values fall.
•
•
•
P99, read as ninety-ninth percentile, is the value below which 99% of the values fall.
1% 1% 1% 1% 1% 1% 1% 1% 1% 1%
P1 P2 P3 P4 P5 P6 P7 P8 P9
53
To compute the ith percentile:
n +1
Pi = the value of the i 100 th observation in the array
Examples: Determine P69, D3, Q1 and Q3 from the data on stat 101 final grades
50 57 63 69 72 74 77 80 82 84 87
50 59 65 69 72 75 77 80 82 84 87
50 59 66 69 72 75 77 80 82 85 88
50 60 66 69 72 75 77 81 83 85 89
50 60 68 70 73 75 78 81 83 86 89
50 60 68 71 73 75 79 81 84 86 91
51 62 68 71 73 76 79 81 84 87 92
52 62 68 71 73 76 79 82 84 87 94
53 62 68 71 74 76 79 82 84 87 94
53 62 69 72 74 76 79 82 84 87 96
54
55
CHAPTER 4
Measures of Dispersion
and
Measures of Skewness
Measures of absolute dispersion are expressed in the units of the original observations.
They can not be used to compare variations of two data sets when the averages of these
data sets differ a lot in value or when the observations differ in units of measurement.
• The Range
Definition. The range of a set of measurements is the difference between the largest and
the smallest values.
Examples:
1. The IQ’s of 5 members of a certain family are 108, 112, 127, 116, and 113. Find the
range.
2. Refer to the example on the final grade of 110 Statistics 101 students. The range is 96
– 50 = 46.
Approximating the range from the frequency distribution table, we get 99 – 50 = 49.
56
• The Standard Deviation and the Variance
N
(X − )
2
i
Definition. For a population of size N, the population variance is 2 = i =1
N
N
(X − )
2
i
and the population standard deviation is = i =1
.
N
(X )
n
2
i −X
Definition. For a sample of size n, the sample variance is S 2 = i =1
n −1
(X )
n
2
i −X
and the sample standard deviation is S = i =1
n −1
Remarks:
1. The standard deviation is the most frequently used measure of dispersion and is
interpreted as the average distance of the data values from their mean.
2. The variance is not a measure of absolute dispersion. It is not expressed in the same
units as the original observations.
Examples:
1. The following scores were given by 6 judges for a gymnast’s performance in the vault:
7, 5, 9, 7, 8, and 6. Find the standard deviation.
3. Refer to the example on the final grades of 110 Statistics 101 students. The sample
standard deviation is given by s = 11.2537745
A recent survey showed that customers of the US Postal Service were interested in more
consistency in the time it takes to make a delivery. Under the old conditions, a local
letter might take only one day to deliver, or it might take several. “Just tell me how many
days ahead I need to mail the birthday card to Mom so it gets there on her birthday, not
early, not late,” was a common complaint. The level of consistency is measured by the
standard deviation of the delivery times. A smaller standard deviation indicates more
consistency.
57
2
n
n
n X − X i
i
2
Computational formula: S = i =1 i =1
n(n − 1)
110 110
Example: the final grade of 110 Statistics 101 students, X = 617936,
i =1
i
2
X
i =1
i = 8152
f (CM )
k k
k
n f i CM − f i CM i
2
−X
2
i i i
S= i =1
= i =1 i =1
n −1 n(n − 1)
Example:
Class Freq CMi fiCMi fiCMi2
50 – 54 10 52 520 27040
55 – 59 3 57 171 9747
60 – 64 8 62 496 30752
65 – 69 13 67 871 58357
70 – 74 17 72 1224 88128
75 – 79 19 77 1463 112651
80 – 84 22 82 1804 147928
85 – 89 13 87 1131 98397
90 – 94 4 92 368 33856
95 – 99 1 97 97 9409
Total 110 8145 616265
58
2
k
k
n f i CM − f i CM i
2
Measures of relative dispersion are unitless and are used when one wishes to compare
the scatter of one distribution with another distribution.
59
Examples.
1. The foreign exchange rate is an indicator of the stability of the peso and is also an
indicator of economic performance. In 1992 Bangko Sentral ng Pilipinas put the peso on
a floating rate basis. Market forces and not government policy have determined the level
of the peso since. Government intervenes through the BSP, only when there are
speculative elements in the market. Given below are the means and standard deviations of
the quarterly P-$ exchange rate for the periods 1989 to 1991 and 1992 to 1994. Which of
the two periods is more stable?
Mean s.d.
1989-1991 22.4 1.84
1992-1994 26.4 1.15
2. Two of the quality criteria in processing butter cookies are the weight and color
development in the final stages of oven browning. Individual pieces of cookies are
scanned by a spectrophotometer calibrated to reflect yellow-brown light. The readout is
expressed in per cent of a standard yellow-brown reference plate and a value of 41 is
considered optimal (golden-yellow). The cookies were also weighed in grams at this
stage. The means and standard deviations of 30 sample cookies are presented below.
Which of the two quality criteria is more varied?
Mean s.d.
Color 41.1 10
Weight 17.7 3.2
Definition. The standard score measures how many standard deviations an observation
X −
is above or below the mean. It is computed as Z = and the sample counterpart is
X−X
Z=
S
Remarks:
1. The standard score is not a measure of relative dispersion per se but is somewhat
related.
2. It is useful for comparing two values from different series specially when these two
series differ with respect to the mean or standard deviation or both are expressed in
different units.
Examples:
1. Robert got a grade of 75% in Stat 101 and a grade of 90% in Econ 11. The mean grade
in Stat 101 is 70% and the standard deviation is 10%, whereas in Econ 11, the mean
grade is 80% and the standard deviation is 20%. Relative to the other students, where did
he perform better?
60
2. In problem (1), if the mean grade in Stat 101 is 65%, in which subject did Robert
perform better?
3. Different typing skills are required for secretaries depending on whether one is
working in a law office, an accounting firm, or for a mathematical research group at a
major university. In order to evaluate candidates for these positions, an agency
administers 3 distinct standardized typing samples. A time penalty has been incorporated
into the scoring of each sample based on the number of typing errors. The mean and
standard deviation for each test, together with the scores achieved by Nancy, an
applicant, are given in the following table. Where do you think should Nancy be placed?
Stockbrokers have a problem when they are considering two investments where the mean
rate of return is the same. They usually calculate the standard deviation of the rates of
return to assess the risk associated with the two investments. The investment with the
larger standard deviation is considered to have the greater risk.
In the 1941 Major League Baseball, Ted Williams batted .406 and nobody has hit over
.400 since. The highest batting average in recent times was by Tony Gwynn, .394 in
1994. It is interesting to note that the mean batting average for all players at about .260
for 100 years. The standard deviation of that average, however, has declined from .049
to .031. This indicates that there is less dispersion in the batting averages today and helps
explain why there have not been any .400 hitters in recent times.
61
4.3 MEASURES OF SKEWNESS
1. Sk =
X − Mo
2. Sk =
(
3 X − Md )
S S
Remarks:
1. Since the mode is frequently only an approximation, formula 2 is preferred.
2. Interpretation of the measure of skewness:
Sk > 0: positively skewed since x > Md > Mo
Sk < 0: negatively skewed since x < Md < Mo
Sk = 0: symmetric since x = Md = Mo
62
4.4 THE BOXPLOT
Definition. The boxplot is a graph that is very useful for displaying the following
features of the data:
• location
• spread
• symmetry
• extremes
• outliers
Remarks:
1. The height of the rectangle is usually arbitrary and has no specific meaning. If several
boxplots appear together, however, the height is sometimes made proportional to the
different sample sizes.
2. If the outlying observation is less than Q1 - 3 IQR or greater than Q3 + 3 IQR it is
identified with a circle at their actual location. Such an observation is called a far outlier.
63
Examples:
9 8 3 0 1
9 9 6 6 2 2 1 1 0 0 0 1 0 4 5 8
2 0 1 2 2 2 3 4 4 5 8
0 3
Set A: Q1 = 15 IQR = 9
Q3 = 24 FL = 1.5
Md = 22 FU = 37.5
0 5 10 15 20 25 30
Set B: Q1 = 10 IQR = 6
Q3 = 16 FL = 1
Md = 12 FU = 25
p
50 55 60 65 70 75 80 85 90 95 100
64
CHAPTER 5
Probability
Definition of Terms
1. Random experiment any process of generating a set of data or observations that
can be repeated under basically the same conditions, which lead to well-defined
outcomes
2. Sample space set of all possible outcomes of an experiment, usually denoted by S
3. Sample point an element of the sample space, an outcome
4. Event any subset of the sample space, usually denoted by capital letters
5. Null space/Empty space a subset of the sample space that contains no elements
and denoted by the symbol .
6. Simple event an event which contains only one element of the sample space
7. Compound event an event that can be expressed as the union of simple events,
thus containing more than one sample point
8. Mutually exclusive events Two events A and B are mutually exclusive if AB =
; that is, A and B have no elements in common
65
Remarks:
• An event is said to have occurred if the outcome of the experiment is one of the
sample points in the event.
• The empty space can be viewed as an event that will never happen. It is called the
impossible event.
• The sample space S, as an event, always occurs, and is referred to as the certain or
sure event.
66
67
5.2 THE PROBABILITY CONCEPT AND SOME PROPERTIES
Defn The probability of an event A, denoted by P(A), is the sum of the probabilities
of mutually exclusive outcomes that constitute the event. It must satisfy the
following properties:
• 0 ≤ P(A) ≤ 1 for any event A
• P(S) = 1 where S is the sample space
• P() = 0
The French naturalist Count Buffon (1707-1788) tossed a coin 4040 times.
Result: 2048 heads, or proportion 2048/4040 = .5069 for heads.
Around 1900, the English statistician Karl Pearson heroically tossed a coin 24,000
times. Result: 12,012 heads, a proportion of .5005.
While imprisoned by the Germans during World War II, the South African
statistician John Kerrich tossed a coin 10,000 times. Result: 5067 heads,
proportion of heads .5067.
The late astronomer Carl Sagan believed that the probability of a major asteroid
hitting the Earth soon is high enough to be of concern. “The probability that the
Earth will be hit by a civilization-threatening small world in the next century is a
little less than one in a thousand.” To arrive at that probability, Sagan obviously
could not use the long-run frequency definition of probability. He would have to
use his own knowledge of astronomy, combined with past asteroid behavior.
68
Examples:
69
Rules of Counting
Theorem If an operation can be performed in n1 ways, and for each of these a second
operation can be performed in n2 ways, then the two operations can be
performed in n1n2 ways.
Example How many sample points are there in the sample space when a pair of
balanced dice is thrown once?
Without considering strategy in a game of chess, there are 400 ways of playing the first
round of moves.
Examples:
1. How many even three-digit numbers can be formed from the digits 1, 2, 5, 6, and
9 if each digit can be used only once?
2. How many ways can a 10-question true-false examination be answered?
70
Theorems on Probabilities of Events
Thm1 P(ABc) = P(A) – P(AB)
P(BAc) = P(B) – P(AB)
P(AB)c = P(AcBc)
Examples:
1. The probability that a student passes Statistics is 2/3, and the probability that he
passes English is 4/9. If the probability of passing at least one of the two courses
is 4/5, what is the probability that he will pass both courses? fail both courses?
2. What is the probability of getting a total of 7 or 11 when a pair of dice is tossed?
3. In the toss of a fair coin 4 times, what is the probability of no head in the toss? At
least one head?
71
Exercises: pp. 95-97 of Walpole nos. 1-20
1. Find the errors in each of the following statements:
a. The probability that it will rain tomorrow is 0.40 and the probability that it
will not rain tomorrow is 0.52.
b. The probabilities that a printer will make 0, 1, 2, 3, or 4 or more mistakes
in printing a document are, respectively, 0.19, 0.34, -0.25, 0.43, and 0.29.
c. The probabilities that an automobile salesperson will sell 0, 1, 2, or 3 cars
on any given day in February are, respectively, 0.19, 0.38, 0.29, and 0.15.
d. On a single draw from a deck of playing cards the probability of selecting
a heart is 1/4, the probability of selecting a black card is 1/2, and the
probability of selecting both a heart and a black card is 1/8.
2. An experiment involves tossing a pair of dice. Find the probability of event
a. A = sum is greater than 8
b. C = a number greater than 4 comes up on one die.
c. AC
3. Three men are seeking public office. Candidates A and B are given about the
same chance of winning, but candidate C is given twice the chance of either A or
B. What is the probability that C wins? A does not win?
4. A box contains 500 envelopes of which 75 contain $100 in cash, 150 contain $25,
and 275 contain $10. An envelope may be purchased for $25. Find the
probability that the first envelope purchased contains less than $100.
5. A 5-sided die with sides numbered 1, 2, 3, 4, and 5 is constructed so that the 1 and
5 occur twice as often as the 2 and 4, which occur three times as often as the 3.
What is the probability that a perfect square occurs when this die is tossed once?
6. If A and B are mutually exclusive events and P(A) = .3 and P(B) = .5, find
a. P(A B)
b. P(A’)
c. P(A’ B)
7. If A, B, and C are mutually exclusive events and P(A) = .2, P(B) = .3 and P(C) =
.2, find
a. P(A B C)
b. P[A’ (B C)]
c. P(B C’)’
8. If a letter is chosen at random from the English alphabet, find the probability that
the letter
(a) is a vowel
(b) precedes the letter j
(c) follows the letter g.
9. If a permutation (rearrangement of the letters) of the word “white” is selected at
random, find the probability that the permutation
(a) begins with a consonant
(b) ends with a vowel
(c) has the consonants and vowels alternating.
10. If each coded item in a catalog begins with 3 distinct letters followed by 4 distinct
nonzero digits, find the probability of randomly selecting one of these coded
items with the first letter a vowel and the last digit even.
72
11. A pair of dice is thrown. Find the probability of getting (a) a total of 8; and (b) at
most a total of 5.
12. Two cards are drawn in succession from a deck without replacement. What is the
probability that both cards are greater than 2 and less than 8?
13. If 3 books are picked at random from a shelf containing 5 novels, 3 books of
poems, and a dictionary, what is the probability that (a) the dictionary is selected;
and (b) 2 novels and 1 book of poems are selected?
14. In a poker hand consisting of 5 cards, find the probability of holding (a) 3 aces;
and (b) 4 hearts and 1 club
15. In a game of Yahtzee, where 5 dice are tossed simultaneously, find the probability
of getting four of a kind.
16. In a college graduating class of 100 students, 54 studied mathematics, 69 studied
history, and 35 studied both mathematics and history. If one of these students is
selected at random, find the probability that the student
(a) takes mathematics or history
(b) does not take either of these subjects
(c) takes history but not mathematics.
17. Suppose that in a senior college class of 500 students it is found that 210 smoke,
258 drink alcoholic beverages, 216 eat between meals, 122 smoke and drink
alcoholic beverages, 83 eat between meals and drink alcoholic beverages, 97
smoke and eat between meals, and 52 engage in all three of these bad health
practices. If a member of this senior class is selected at random, find the
probability that the student
(a) smokes but does not drink alcoholic beverages
(b) eats between meals and drinks alcoholic beverages but does not smoke
(c) neither smokes nor eats between meals.
18. The probability that an American industry will locate in Munich is .7, the
probability that it will locate in Brussels is .4, and the probability that it will
locate in either Munich or Brussels or both is .8. What is the probability that the
industry will locate in
(a) both cities
(b) neither city?
19. From past experiences a stockbroker believes that under present economic
conditions a customer will invest in tax-free bonds with a probability of .6, will
invest in mutual funds with a probability of .3, and will invest in both tax-free
bonds and mutual funds with a probability of .15. At this time, find the
probability that a customer will invest in
(a) either tax-free funds or mutual bonds
(b) neither tax-free bonds nor mutual funds.
20. In a certain federal prison it is known that 2/3 of the inmates are under 25 years of
age. It is also known that 3/5 of the inmates are male and the 5/8 of the inmates
are female or over 25 years of age or older. What is the probability that a prisoner
selected at random from this prison is female and at least 25 years old?
73
Defn The probability of an event B occurring when it is known that some event A has
occurred is called a conditional probability. It is defined as
P( A B)
P( B | A) = , if P(A)>0
P( A)
Examples:
1. A random sample of 200 adults is classified below according to sex and the level
of education attained. If a person is picked at random from this group, find the
probability that the person
a. is a male, given that the person has a secondary education.
b. does not have a college degree, given that the person is a female.
Male Female
Elementary 38 45
Secondary 28 50
College 22 17
2. The probability that a regularly scheduled flight departs on time is .83, the
probability that it arrives on time is .92, and the probability that it departs and
arrives on time is .78. Find the probability that a plane (a) arrives on time given
that it departed on time, and (b) departed on time given that it has arrived on time.
74
3. Suppose there has been a crime and it is known that the criminal is a person
within a population of 6,000,000. Further, suppose it is known that that in this
population only about one person in a million has a DNA type that matches the
DNA found at the crime scene, so let’s assume that there are six people in the
population with this DNA type. Someone in custody has this DNA type. We
know the person’s DNA matches, but what is the probability that he is actually
innocent?
Define A = DNA of randomly chosen person matches DNA at the crime scene
B = person selected is innocent of the crime
AB = event that the selected person is innocent and the DNA matches
P( A B) 5 / 6,000,000 5
So that P( B | A) = = =
P( A) 6 / 6,000,000 6
P( A B) 5 / 6,000,000 5
And P( A | B) = = = .
P( B) 5,999,999 / 6,000,000 5,999,999
If you were the jury, it would be important to realize that without additional
evidence, the probability that this person is innocent is 5/6, even though the DNA
matches. The prosecutor surely would emphasize the other conditional
probability.
Defn Two events A and B are said to be independent if any one of the following
conditions is satisfied:
(a) P(A|B) = P(A) if P(B)>0
(b) P(B|A) = P(B) if P(A)>0
(c) P(AB) = P(A) P(B)
Otherwise, the events are said to be dependent.
Examples:
1. Consider an experiment in which 2 cards are drawn in succession from an
ordinary deck, with replacement. Define
A: the first card is an ace
B: the second card is a spade
Are A and B independent events?
Spade
Ace
SpadeC
Spade
C
Ace
SpadeC
75
2. Consider the following events in the toss of a single die where even numbers are
twice as likely to occur as the odd numbers. Define A: Get a number greater than
3 and B: Get a perfect square. Are A and B independent events?
3. Suppose that we have a fuse box containing 20 fuses, of which 5 are defective. If
2 fuses are selected at random and removed from the box in succession without
replacing the first, what is the probability that both are defective?
4. A small town has one fire engine and one ambulance available for emergencies.
The probability that the fire engine is available when needed is .98, and the
probability that the ambulance is available when called is .92. In the event of an
injury resulting from a burning building, find the probability that both the
ambulance and the fire engine will be available.
5. Three cards are drawn in succession, without replacement, from an ordinary deck
of playing cards. Find the probability that the first card is a red ace, the second
card is a ten or jack, and the third card is greater than 3 but less than 7.
6. A coin is biased so that a head is twice as likely to occur as a tail. If the coin is
tossed 3 times, what is the probability of getting 2 tails and 1 head?
7. Assuming birth months (days) are equally likely, what is the probability that the
next two unrelated strangers you meet both share your birth month (day)?
76
77
8. Sudden infant death syndrome (SIDS) causes babies to die suddenly (often in
their cribs) with no explanation. Deaths from SIDS have been greatly reduced by
placing babies on their backs, but as yet no cause is known.
When more than one SIDS death occurs in a family, the parents are sometimes
accused. One “expert witness” popular with prosecutors in England told juries
that there is only a 1 in 73 million chance that two children in the same family
could have died naturally. Here’s his calculation: the rate of SIDS in a
nonsmoking middle-class family is 1 in 8500. So the probability of two deaths is
8500 8500 = 72 , 250, 000 .
1 1 1
Several women were convicted of murder on this basis,
without any direct evidence that they harmed their children.
As the Royal Statistical Science said, this reasoning is nonsense. It assumes that
SIDS deaths in the same family are independent events. The cause of SIDS is
unknown: “There may well be unknown genetic or environmental factors that
predispose families to SIDS, so that a second case in the family becomes much
more likely.” The British government decided to review the cases of 258 parents
convicted of murdering their babies.
9. Many people who come to clinics to be tested for HIV, the virus that causes
AIDS, don’t come back to learn the test results. Clinics now use “rapid HIV
tests” that give a result in a few minutes. The false positive rate for a diagnostic
test is the probability that a person with no disease will have a positive test result.
For the rapid HIV tests, the Food and Drug Administration has established 2% as
the maximum false positive rate. If a clinic uses a test that meets the FDA
standard and tests 50 people who are free of HIV antibodies, what is the
probability that at least one false-positive will occur?
P(at least one positive) = 1 – P(no positives)
= 1 – P(50 negatives)
= 1 – (1-.02)50 = .6358
There is approximately 64% chance that at least one of the 50 people will test
positive for HIV, even though no one has the virus.
Concern about excessive numbers of false positives led the New York City
Department of Health and Mental Hygiene to suspend the use of one particular
rapid HIV test.
10. Only 5% of male high school basketball, baseball, and football players go on to
play at the college level. Of these, only 1.7% enter major league professional
sports. About 40% of the athletes who compete in college and then reach the pros
have a career of more than three years. Define these events: A = {competes in
college}, B = {competes professionally}, C = {pro career longer than 3 years}.
What is the probability that a high school athlete competes in college and then
goes on to have a pro career of more than three years?
We know that P(A) = .05, P(B|A) = .017, P(C|AB) = .4. The probability we
want is therefore P(ABC) = P(A)P(B|A)P(C|AB)
= .05 .017 .4 = .00034
Only about 3 of every 10,000 high school athletes can expect to compete in
college and have a professional career of more than three years. High school
students would be wise to concentrate on studies rather than on unrealistic hopes
of fortune from pro sports.
78
Exercises: pp. 105-108 of Walpole nos. 1-18
1. If R is the event that a convict committed armed robbery and D is the event that
the convict pushed dope, state in words what probabilities are expressed by
a. P(R|D)
b. P(D’|R)
c. P(R’|D’)
2. A class in advanced physics is comprised of 10 juniors, 30 seniors, and 10
graduate students. The final grades showed that 3 of the juniors, 10 seniors, and 5
graduate students received an A for the course. If a student is chosen at random
from this class and is found to have earned an A, what is the probability that he or
she is a senior?
3. Consider the event B of getting a perfect square when a die is tossed. The die is
constructed so that the even numbers are twice as likely to occur as the odd
numbers. Suppose it is known that the toss of the die resulted in A = a number
greater than 3. Find P(B|A).
4. In the senior year of a high school graduating class of 100 students, 42 studied
mathematics, 68 studied psychology, 54 studied history, 22 studied both
mathematics and history, 25 studied both mathematics and psychology, 7 studied
history but neither mathematics nor psychology, 10 studied all three subjects, and
8 did not take any of the three. If a student is selected at random, find the
probability that a person
(a) enrolled in psychology takes all three subjects
(b) not taking psychology is taking both history and mathematics.
5. A pair of dice is thrown. If it is known that one die shows a 4, what is the
probability that
(a) the other die shows a 5
(b) the total of both dice is greater than 7.
6. A card is drawn from an ordinary deck and we are told that it is red. What is the
probability that the card is greater than 2 but less than 9?
7. The probability that an automobile being filled with gasoline will also need an oil
change is .25, the probability that it needs a new oil filter is .4, and the probability
that both the oil and filter need changing is .14.
(a) If the oil had to be changed, what is the probability that a new oil filter is
needed?
(b) If a new oil filter is needed, what is the probability that the oil has to be
changed?
8. The probability that a married man watches a certain television show is .4 and the
probability that a married woman watches the show is .5. The probability that a
man watches the show, given that his wife does, is .7. Find the probability that
(a) a married couple watches the show
(b) a wife watches the show given that her husband does
(c) at least one person of a married couple will watch the show.
9. The probability that a vehicle entering the Luray Caverns has Canadian license
plates is .12, the probability that it is a camper is .28, and the probability that it is
a camper with Canadian license plates is .09. What is the probability that
(a) a camper entering the Luray Caverns has Canadian license plates?
79
(b) a vehicle with Canadian license plates entering the Luray Caverns is a
camper?
(c) a vehicle entering the Luray Caverns does not have a Canadian license plates
or is not a camper?
10. The probability that the lady of the house is home when the Avon representative
calls is .6. Given that the lady of the house is home, the probability that she
makes a purchase is .4. Find the probability that the lady of the house is home
and makes a purchase when the Avon representative calls.
11. The probability that a doctor correctly diagnoses a particular illness is .7. Given
that the doctor makes an incorrect diagnosis, the probability that the patient enters
a law suit is .9. What is the probability that the doctor makes an incorrect
diagnosis and the patient sues?
12. One bag contains 4 white balls and 3 black balls, and a second bag contains 3
white balls and 5 black balls. One ball is drawn at random from the first bag and
placed unseen in the second bag. What is the probability that a ball now drawn
from the second bag is black? (Hint: Let B1, B2, and W1 represent, respectively,
the drawing of a black ball from bag 1, a black ball from bag 2, and a white ball
from bag 1. We are interested in B1 B2 and W1 B2.)
13. A real estate agent has 8 master keys to open several new homes. Only 1 master
key will open any given house. If 40% of these homes are usually left unlocked,
what is the probability that the real estate agent can get into a specific home if the
agent selects 3 master keys at random before leaving the office? (hint: Let A =
the house is open and B = the correct key is one of the 3 selected before leaving
the office. One event is A’ B.)
14. A town has 2 fire engines operating independently. The probability that a specific
fire engine is available when needed is .96. What is the probability that
(a) neither is available when needed
(b) that a fire engine is available when needed?
15. If the probability that Tom will be alive in 20 years is .7 and the probability that
Nancy will be alive in 20 years is .9, what is the probability that neither will be
alive in 20 years?
16. The probability that a person visiting his dentist will have an x-ray is .6; the
probability that a person who has an x-ray will also have a cavity filled is .3; and
the probability that the person who has had an x-ray and a cavity filled will also
have a tooth extracted is .1. What is the probability that a person visiting his
dentist will have an x-ray, a cavity filled, and a tooth extracted?
17. Find the probability of randomly selecting 4 good quarts of milk in succession
from a cooler containing 20 quarts of which 5 are spoiled.
18. From a box containing 6 black balls and 4 green balls, 3 balls are drawn in
succession, each ball being replaced in the box before the next draw is made.
What is the probability that all 3 are the same color? Each color is represented?
80
CHAPTER 6
Probability Distributions
Remark We shall use an uppercase letter, say X, to denote a random variable and
its corresponding lowercase letter, x in this case, for one of its values.
Examples:
1. (Experiment No. 1) An experiment consists of tossing a coin 3 times and
observing the result. The possible outcomes and the values of the random
variables X and Y, where X is the number of heads and Y is the number of heads
minus the number of tails are
Sample Points x y
HHH 3 3
HHT 2 1
HTH 2 1
HTT 1 -1
THH 2 1
THT 1 -1
TTH 1 -1
TTT 0 -3
81
6.2 DISCRETE PROBABILITY DISTRIBUTIONS
Defn A random variable defined over a discrete sample space is called a discrete
random variable.
Defn A table or formula listing all possible values that a discrete random variable can
take on, along with the associated probabilities, is called a discrete probability
distribution.
Remark The probabilities associated with all possible values of a discrete random
variable must sum to 1.
Examples:
1. For Experiment No. 1, the discrete probability distributions of the random
variables X and Y are
x 0 1 2 3
P(X=x) 1/8 3/8 3/8 1/8
y -3 -1 1 3
P(Y=y) 1/8 3/8 3/8 1/8
2. Construct the discrete probability distribution for the random variable M defined
in Experiment No. 2.
82
6.3 EXPECTED VALUES
x x1 x2 ... xn
P(X=x) f(x1) f(x2) ... f(xn)
n
X = E ( X ) = xi f ( xi )
i =1
Examples:
1. Find the mean of the random variables X and Y of Experiment No. 1.
x 0 1 2 3
P(X=x) 1/8 3/8 3/8 1/8
y -3 -1 1 3
P(Y=y) 1/8 3/8 3/8 1/8
83
Thm Let X be a discrete random variable with probability distribution
x x1 x2 ... xn
P(X=x) f(x1) f(x2) ... f(xn)
n
E[ g ( X )] = g ( xi ) f ( xi )
i =1
X2 = V ( X ) = E ( X − ) 2
x x1 x2 ... xn
P(X=x) f(x1) f(x2) ... f(xn)
The variance of X is
n
X2 = V ( X ) = E ( X − ) 2 = ( xi − ) 2 f ( xi )
i =1
84
Binomial Distribution
Examples:
1. Find the probability of obtaining exactly three 2’s if an ordinary die is tossed 5
times.
2. In a certain city district the need for money to buy drugs is given as the reason for
75% of all thefts. What is the probability that exactly 2 of the next 4 theft cases
reported in this district resulted from the need for money to buy drugs?
3. The probability that a patient recovers from a rare blood disease is .4. If 15
people are known to have contracted this disease, what is the probability that (a) 5
survive; (b) 3 to 8 survive?; and (c) at least 10 survive?
85
Exercises: pp. 165-166 of Walpole nos. 4, 6-10, 12, 13
4. A baseball player’s batting average is .250. What is the probability that he gets
exactly 1 hit in his next 5 times at bat?
6. A multiple-choice quiz has 15 questions, each with 4 possible answers of which
only 1 is the correct answer. What is the probability that sheer guesswork yields
a. exactly 10 correct answers
b. at least 1 correct answer
c. 5 to 10 correct answers .
7. The probability that a patient recovers from a delicate heart operation is .9. What
is the probability that exactly 5 of the next 7 patients having this operation
survive?
8. A study conducted at George Washington University and the National Institute of
Health examined national attitudes about tranquilizers. The study revealed that
approximately 70% believe “tranquilizers don’t really cure anything, they just
cover up the real trouble.” According to this study, what is the probability that at
least 3 of the next 5 people selected at random will be of the opinion that
tranquilizers do cure the problem rather than just cover it up?
9. A survey of the residents in a United States city showed that 20% preferred a
white telephone over any other color available. What is the probability that more
than one-half of the next 20 telephone installed in this city will be white?
10. One-fourth of the female freshmen entering a Virginia college are out-of-state
students. If the students are assigned at random to the dormitories, 3 to a room,
86
what is the probability that in one room at most 2 of the 3 roommates are out-of-
state students?
11.
12. Suppose that airplane engines operate independently in flight and fail with
probability q = .2. Assuming that a plane makes a safe flight if at least one-half of
its engines run, determine whether a 4-engine plane or a 2-engine plane has the
higher probability for a successful flight.
13. Repeat Exercise 12 for q =.5 and q = 1/3.
Near the end of World War II, the Germans developed rocket bombs, which were fired at
the city of London. The Allied military command did not know whether these bombs
were fired at random or whether they had some type of aiming device. To investigate,
the city of London was divided into 576 square regions and the number of hits per region
was counted and compared with the expected number of hits under a special discrete
probability distribution. Because the actual number of hits was close to the expected
number of hits, the military command concluded that the bombs were falling at random.
The Germans had not developed a bomb with an aiming device.
87
6.4 Continuous Probability Distributions
Defn If a sample space contains an infinite number of possibilities equal to the number
of points on a line segment, it is called a continuous sample space.
Defn A random variable defined over a continuous sample space is called a continuous
random variable.
Defn The function with values f(x) is called a probability density function for the
continuous random variable X, if
• the total area under its curve and above the horizontal axis is equal to
1; and
• the area under the curve between any two ordinates x = a and x = b
gives the probability that X lies between a and b.
Remarks:
1. A continuous random variable has a probability of zero of assuming exactly any
of its values, that is, if X is a continuous random variable, then P(X=x) = 0 for all
real numbers x.
2. The probability density function can not be represented in tabular form.
Example A continuous random variable X that can assume values between 0 and 2
has a density function given by
.5, for 0 x 2
f ( x) =
0, otherwise
88
Properties of the Mean and Variance
Let X and Y be random variables (discrete or continuous) and let a and b be constants.
1. E(aX b) = aE(X) b
• E(aX) = aE(X).
• E(b) = b.
2. E(aX bY) = aE(X) bE(Y)
3. E(XY) = E(X)E(Y) if X and Y are independent.
4. E[ X - E(X) ] = 0.
5. Var(aX b) = a2Var(X).
• Var(aX) = a2Var(X).
• Var(b) = 0.
6. If X and Y are independent then Var(aX bY) = a2Var(X) + b2Var(Y)
Example A used car dealer finds that in any day, the probability of selling no car is
0.4, one car is 0.2, two cars is 0.15, 3 cars is 0.10, 4 cars is 0.08, five cars
is 0.06 and six cars is 0.01. Let g(X) = 500 + 1500X represent the
salesman’s daily earnings, where X is the number of cars sold. Find the
salesman’s expected daily earnings.
89
THE NORMAL DISTRIBUTION
f ( x) =
1
e
− 12 ( )
x− 2
2
for - < x < and for constants µ and σ, where - < µ < , σ > 0 and e2.71828
and 3.14159.
The graph of the normal (or Gaussian) distribution is called the normal curve.
90
Example of two normal curves with µ1 ≠ µ2 and σ1 = σ2
Properties:
1. The curve is bell-shaped and symmetric about a vertical axis through the mean µ.
2. The normal curve approaches the horizontal axis asymptotically as we proceed in
either direction away from the mean.
3. The total area under the curve and above the horizontal axis is equal to 1.
Defn The distribution of a normal random variable with mean zero and standard
deviation equal to 1 is called a standard normal distribution.
Hence, whenever X is between the values x1 and x2, the random variable Z will fall
between the corresponding values
x − x −
z1 = 1 and z 2 = 2
91
Examples:
1. Given a normal distribution with µ= 300 and σ = 50, find the probability that X
assumes a value greater than 362.
2. Given a normal distribution with µ= 50 and σ = 10, find the probability that X
assumes a value between 45 and 62.
3. Given a normal distribution with µ= 40 and σ = 6, find the value of x that has (a)
5% of the area above it and (b) 38% of the area below it.
4. A certain type of storage battery lasts on the average 3.0 years, with a standard
deviation of .5 year. Assuming that the battery lives are normally distributed, find
the probability that a given battery will last less than 2.3 years.
5. An electrical firm manufactures light bulbs that have a length of life that is
normally distributed with mean equal to 800 hours and a standard deviation of 40
hours. Find the probability that a bulb burns between 778 and 834 hours.
6. On an examination the average grade was 74 and the standard deviation was 7. If
the grades are curved to follow a normal distribution, find D6.
92
P(Z > z) where Z ~ N(0, 1)
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-3.9 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
-3.8 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
-3.7 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
-3.6 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999 0.9999
-3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
-3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
-3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
-3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
-3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
-3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
-2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
-2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
-2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
-2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
-2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
-2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
-2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
-2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
-2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
-2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
-1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
-1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
-1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
-1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
-1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
-1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
-1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
-1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
-1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
-1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
-0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
-0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
-0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
-0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
-0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
-0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
-0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
-0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
-0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
-0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
93
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641
0.1 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247
0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859
0.3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483
0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121
0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776
0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451
0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148
0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867
0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611
1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379
1.1 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170
1.2 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985
1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823
1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681
1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559
1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455
1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367
1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294
1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183
2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143
2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110
2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084
2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064
2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048
2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036
2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026
2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019
2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014
3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010
3.1 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.0007
3.2 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.0005
3.3 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.0003
3.4 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0002
3.5 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002
3.6 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
3.7 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
3.8 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
3.9 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
94
Exercises: pp. 197-199 of Walpole nos. 1-16
1. Given a normal distribution with µ = 40 and σ = 6, find
a. the area below 32
b. the area above 27
c. the area between 42 and 51
d. the x value that has 45% of the area below it
e. the x value that has 13% of the area above it
2. Given a normal distribution with µ= 200 and σ2 = 100, find
a. the area below 214
b. the area above 179
c. the area between 188 and 206
d. the x value that has 80% of the area below it
e. two x values containing the middle 75% of the area
3. Given the normally distributed random variable X with mean 18 and standard
deviation 2.5, find
(a) P(X < 15)
(b) P(17 < X < 21)
(c) the value of k such that P( X < k) = .2578
(d) the value of k such that P(X > k) = .1539.
4. A soft-drink machine is regulated so that it discharges an average of 200 ml. per
cup. If the amount of drink is normally distributed with standard deviation equal
to 15 ml.
(a) what fraction of cups will contain more than 224 ml.?
(b) what is the probability that a cup contains between 191 and 209 ml.?
(c) how many cups will likely overflow if 230-ml. cups are used in the next 1000
drinks?
(d) below what value do we get the smallest 25% of the drinks?
5. The finished inside diameter of a piston ring is normally distributed with a mean
of 10 cm. and a standard deviation of .03 cm.
(a) What proportion of rings will have inside diameters exceeding 10.075 cm.?
(b) What is the probability that a piston ring will have an inside diameter between
9.97 and 10.03 cm.?
(c) Below what value of inside diameter will 15% of piston rings fall?
6. A lawyer commutes daily from his suburban home to his midtown office. On the
average the trip one way takes 24 minutes, with a standard deviation of 3.8
minutes. Assume the distribution of trip times to be normally distributed.
(a) What is the probability that the trip will take at least ½ hour?
(b) If the office opens at 9 AM and he leaves his house at 8:45 AM, what
percentage of the time is he late for work?
(c) If he leaves the house at 8:35 AM and coffee is served at the office from 8:50
AM until 9 AM, what is the probability that he misses coffee?
(d) Find the length of time above which we find the slowest 15% of the trips
7. If a set of grades on a statistics examination are approximately normally
distributed with a mean of 74 and a standard deviation of 7.9, find
(a) the lowest passing grade if the lowest 10% of the students are given F’s
95
(b) the highest B if the top 5% of the students are given A’s
(c) the lowest B if the top 10% of the students are given A’s and the next 25% are
given B’s.
8. In a mathematics examination the average grade was 82 and the standard
deviation was 5. All students with grades from 88 to 94 received a grade of B. If
the grades are approximately normally distributed and 8 students received a B
grade, how many students took the examination?
9. The heights of 1000 students are normally distributed with a mean of 174.5 cm.
and a standard deviation of 6.9 cm. How many of these students would you
expect to have heights (a) less than 160.0 cm; (b) between 171.5 and 182.0 cm;
(c) equal to 175.0 cm; and (d) greater than or equal to 188.0 cm?
10. A company pays its employees an average wage of $7.25 an hour with a standard
deviation of 60 cents. If the wages are approximately normally distributed
(a) what percentage of the workers receive wages between $6.75 and $7.69 an
hour?
(b) the highest 5% of the employee hourly wages are greater than what amount?
11. The weights of a large number of miniature poodles are approximately normally
distributed with a mean of 8 kg. and a standard deviation of .9 kg. Find the
fraction of these poodles with weights
(a) over 9.5 kg.
(b) at most 8.6 kg.
(c) between 7.3 and 9.1 kg.
12. The tensile strength of a certain metal component is normally distributed with a
mean of 10,000 kg/cm2 and a standard deviation of 100 kg/cm2.
(a) What proportion of these components exceeds 10,150 kg/cm2 in tensile
strength?
(b) If specifications require that all components have tensile strength between
9800 and 10,200 kg/cm2, what proportion of pieces would you expect to scrap?
13. If a set of observations is normally distributed, what percentage of the
observations differs from the mean by
(a) more than 1.3?
(b) less than .52?
14. The IQs of 600 applicants to a certain college are approximately normally
distributed with a mean of 115 and a standard deviation of 12. If the college
requires an IQ of at least 95, how many of these students will be rejected on this
basis regardless of their other qualifications?
15. The average rainfall in Roanoke, Virginia for the month of March is 9.22 cm.
Assuming a normal distribution with a standard deviation of 2.83 cm, find the
probability that next March Roanoke receives (a) less than 1.84 cm of rain; (b)
more than 5 cm but not over 7 cm of rain; and (c) more than 13.8 cm of rain.
16. The average life of a certain type of small motor is 10 years with a standard
deviation of 2 years. The manufacturer replaces free all motors that fail while
under guarantee. If he is willing to replace only 3% of the motors that fail, how
long a guarantee should he offer? Assume that the lives of the motors follow
normal distribution.
96
CHAPTER 7
Sampling Distributions
Consider three observations making up the population values 0, 1, and 2 with parameters
N N
Xi 2
(X i − )2
= i =1
= 1 and 2 = i =1
. =
N N 3
Suppose we list all possible samples of size 2, with replacement, and for each possible
n
X i
sample compute the value of the sample mean, X = i =1
:
n
0.35
X = average of the X ’s = 1 = µ
0.3
0.25
And
0.2
Thm1 If all possible random samples of size n are drawn with replacement from a finite
population of size N with mean µ and standard deviation σ, then the sample mean
will have mean xP( X = x) = µ and variance E ( X − ) 2 = 2 / n.
97
Suppose we list all possible samples of size 2, without replacement, and for each possible
sample compute the value of the sample mean, X :
0.35
X = average of the X ’s = 1 = µ
0.3
0.25
And
0.2
Thm2 If all possible random samples of size n are drawn without replacement from a
finite population of size N with mean µ and standard deviation σ, then the sample
2 N −n
mean will have mean xP( X = x) = µ and variance E ( X − ) 2 =
n N −1
N −n
The factor is called the finite population correction factor. For large N relative
N −1
to the sample size n, this factor will be close to 1 and the variance is approximately equal
to σ2 /n.
98
2
n n
n
(X i − X )2
n X i
2
− Xi
i =1 for each possible
Exercise: Compute S 2 = i =1 = i =1
n −1 n(n − 1)
sample and determine if S 2 = average of the S2’s = 2 .
Thm3 (Central Limit Theorem) If X is the mean of a random sample of size n taken
from a (large or infinite) population with mean µ and variance σ2, then the
sampling distribution of X is approximately normally distributed with mean
E( X ) = µ and variance Var( X ) = σ2/n when n is sufficiently large. Hence, the
limiting form of the distribution of
X −
Z=
/ n
as n approaches infinity is the standard normal distribution.
• The normal approximation in the theorem will be good if n ≥ 30 regardless of
the shape of the population.
• If n < 30, the approximation is good only if the population is not too different
from the normal.
• If the distribution of the population is normal then the sampling distribution
will also be exactly normal, no matter how small the size of the sample.
Example An electrical firm manufactures electric light bulbs that have a length of
life which is normally distributed with mean and standard deviation equal
to 500 and 50 hours, respectively. Find the probability that a random
sample of 15 bulbs will have an average life of less than 475 hours.
99
Thm4 (The t-distribution) If X and S2 are the mean and variance, respectively, of a
random sample of size n taken from a population which is normally distributed
with mean µ and variance σ2, then
X −
T=
S/ n
is a random variable having the t - distribution with v = n-1 degrees of freedom.
Notation: T~ tv=n-1
Just like any continuous probability distribution, the probability that a random
sample produces a t-value falling between any two specified values is equal to the
area under the curve of the t-distribution between any two ordinates
corresponding to the specified values
Examples:
1. Find the following values on the t -table:
(a) t0.025 when v = 14.
(b) t0.99 when v=10.
2. Find k such that P(k < T < 2.807) = 0.945 when T ~ t(23)
100
3. A manufacturing firm claims that the batteries used in their electronic games will
last an average of 30 hours. To maintain this average, 16 batteries are tested each
month. If the computed t-value falls between -t0.025 and t0.025 , the firm is satisfied
with its claim. What conclusion should the firm draw from a sample that has
mean X = 27.5 hours and standard deviation S = 5 hours? Assume the
distribution of battery lives to be approximately normal.
Thm5 If X and S2 are the mean and variance, respectively, of a random sample of size n
X −
> 30 taken from a population with mean µ and variance σ2, then is a
S/ n
random variable having approximately a standard normal distribution.
101
P(tv > tα,v) = α
102
103
CHAPTER 8
Estimation
Defn Statistical inference refers to methods by which one uses sample information to
make inferences or generalizations about a population.
Example A commonly prescribed drug on the market for relieving nervous tension
is believed to be only 60% effective. Experimental results with a new drug
administered to a random sample of 100 adults who were suffering from
nervous tension showed that 70 received relief.
a. Estimate the population proportion of nervous tension patients who
will receive relief with the experimental drug.
b. Is this sufficient evidence to conclude that the new drug is superior
to the one commonly prescribed?
Point Estimation
Defn An estimator is any statistic whose value is used to estimate an unknown
parameter. A realized value of an estimator is called an estimate.
Remarks:
1. An estimator is said to be unbiased if the average of the estimates it produces
under repeated sampling is equal to the true value of the parameter being
estimated.
Examples:
• Under random sampling, the sample mean is an unbiased estimator of the
population mean, that is, E( X ) = μ.
104
• Under random sampling with replacement, S2 is an unbiased estimator of σ2,
but S on the other hand is a biased estimator of σ with the bias becoming
insignificant for large samples.
Samples S2 S Samples S2 S
0,0 0 0
0,1 .5 .707107 0,1 .5 .707107
0,2 2 1.414214 0,2 2 1.414214
1,0 .5 .707107 1,0 .5 .707107
1,1 0 0
1,2 .5 .707107 1,2 .5 .707107
2,0 2 1.414214 2,0 2 1.414214
2,1 .5 .707107 2,1 .5 .707107
2,2 0 0
Mean .666667 .628539 Mean 1 .942809
2. A parameter can have more than one unbiased estimator. We would naturally
choose the unbiased estimator with the smallest variance.
Possible Samples X Md
0,1,2 1 1
0,1,3 1.33 1
0,2,1 1 1
0,2,3 1.67 2
0,3,1 1.33 1
0,3,2 1.67 2
1,0,2 1 1
1,0,3 1.33 1
1,2,0 1 1
1,2,3 2 2
1,3,0 1.33 1
1,3,2 2 2
2,0,1 1 1
2,0,3 1.67 2
2,1,0 1 1
2,1,3 2 2
2,3,0 1.67 2
2,3,1 2 2
3,0,1 1.33 1
3,0,2 1.67 2
3,1,0 1.33 1
3,1,2 2 2
3,2,0 1.67 2
3,2,1 2 2
Mean 1.5 1.5
Variance .138889 .25
105
106
Interval Estimation
Example The running time (in minutes) of a sample of films produced by Star-
Regal Theater are as follows: 103 94 110 87 98.
A 95% confidence interval for the mean running time of films produced
by Star-Regal Theater is (87.6, 109.2).
Remarks:
1. In general, we construct a (1-α)100% confidence interval. The fraction (1-α) is
called the confidence coefficient, and the endpoints a and b are called the lower
and upper confidence limits, respectively.
2. The confidence coefficient is not “the probability that the true value of the
parameter falls in the interval estimate” since once a sample is drawn and a
confidence interval constructed, the resulting interval estimate either encloses the
true value of the parameter or it doesn’t. Rather, the confidence coefficient is “the
probability that the interval estimator encloses the true value of the parameter.”
3. A good confidence interval is one that is as narrow as possible and has a large
confidence coefficient, near 1. The narrower the interval, the more exactly we
have located the parameter; whereas, the larger the confidence coefficient, the
more confidence we have that a particular interval encloses the true value of the
parameter. However, for a fixed sample size, as the confidence coefficient
increases, the length of the interval also increases.
107
4. Interpretation of (1-α)100% confidence interval: If we take repeated samples of
size n and if for each one of these samples we compute the (1-α)100% confidence
interval then (1-α)100% of the resulting confidence intervals will contain the
unknown value of the parameter.
Xi (X i − )2
2, 3, and 4 with = i =1
= 2 and 2 = i =1
= 2.
N N
Suppose we list all possible samples of size 2, with replacement
and compute the 90% confidence interval for each possible sample.
108
8.2 ESTIMATING PROPORTIONS
X
In a binomial experiment a point estimator of the proportion p is p = , where X
n
pq
represents the number of successes in n trials, with standard error of and margin of
n
pq
error of z / 2 . If the unknown proportion is not expected to be too close to 0 or 1
n
and n is large, an approximate (1-α)100% confidence interval for p is given by
pq
p z / 2
n
t-table
α/2
v ⎯⎯⎯→ tα/2,v
⎯⎯⎯→ zα/2
z-table
zα/2
⎯⎯⎯ α/2
109
Example In a random sample of 200 students who enrolled in Math 17, 138 passed
on their first take. Construct a 95% confidence interval for the population
proportion of students who passed Math 17 on their first take.
^
If p will be used to estimate p, then we can be (1-α)100% confident that the error will
z 2 p(1 − p)
not exceed a specified amount, e, when the sample size is n = / 2 2
e
When the value of p is unknown or cannot be approximated, then using p = 0.5 produces
the maximum value of p(1-p)=0.25. Hence a conservative formula for the sample size is
z2 / 2
n= 2
4e
Example Use the conservative formula to determine the sample size needed if we
want to be 95% confident that our estimate of p is within 0.05 of the true
value.
The SWS national survey for the fourth quarter of 2013, done on Dec. 11-16, expanded
its Visayas sample to 650 households, from the usual 300, thus reducing the Visayas error
margin to 4 percentage points, from the usual 6 points. This raised the national sample
size to 1,550 households, from the usual 1,200, enhancing the quality of Yolanda-related
items in particular, since the Visayas was the area that suffered the most.
pq 1
A conservative estimate for the margin of error is z / 2 .
n n
110
111
8.3 ESTIMATING THE DIFFERENCE OF TWO PROPORTIONS
Given 2 independent random samples of size n1 and n2 , a point estimator of the difference
X X
between the two proportions p1 and p2 is given by p 1 − p 2 = 1 − 2 , where X1 is the
n1 n2
number of successes in n1 trials (first sample) and X2 is the number of successes in n2 trials
(second sample).
p1 q1 p 2 q 2
p 1 − p 2 z / 2 +
n1 n2
Example In a random sample of 200 students, 78 of the 120 females and 60 of the
80 males passed Math 17 on their first take. Construct a 95% confidence
interval for p1- p2, where p1 and p2 are the true proportions of females and
males, respectively, who passed Math 17 on their first take.
112
Exercises: pp. 273-274 of Walpole nos. 1-13
1. A random sample of 200 voters is selected and 120 are found to support an
annexation suit. Find the 96% confidence interval for the fraction of the voting
population favoring the suit.
2. A random sample of 400 cigarette smokers is selected and 86 are found to have a
preference for brand X. Find the 90% confidence interval for the fraction of the
population of cigarette smokers who prefer brand X.
3. In a random sample of 1000 homes in a certain city, it is found that 628 are heated
by natural gas. Find the 98% confidence interval for the fraction of homes in this
city that are heated by natural gas.
4. A random sample of 75 college students is selected and 16 are found to have cars
on campus. Use a 95% confidence interval to estimate the fraction of students
who have cars on campus.
5. A new rocket-launching system is being considered for deployment of small
short-range launches. The existing system has p = .8 as the probability of a
successful launch. A sample of 40 experimental launches is made with the new
system and 34 are successful. Construct a 95% confidence interval for p.
6. How large a sample is needed in Exercise 1 if we wish to be 96% confident that
our sample proportion will be within .02 of the true fraction of the voting
population?
7. How large a sample is needed in Exercise 3 if we wish to be 98% confident that
our sample proportion will be within .05 of the true proportion of homes in this
city that are heated by natural gas?
8. A study is to be made to estimate the percentage of citizens in a town who favor
having their water fluoridated. How large a sample is needed if one wishes to be
at least 95% confident that our estimate is within 1% of the true percentage?
9. According to Dr. Memory Elvin-Lewis, head of the microbiology department at
Washington University School of Dental Medicine in St. Louis, a couple of cups
of either green or oolong tea each day will provide sufficient fluoride to protect
your teeth from decay. People who do not like tea and who live in unfluoridated
areas should ask their local governments to consider having their water
fluoridated. How large a sample is needed to estimate the percentage of citizens
in a certain town who favor having their water fluoridated if one wishes to be at
least 99% confident that the estimate is within 1% of the true percentage?
10. In a study to estimate the proportion of residents in a certain city and its suburbs
who favor the construction of a nuclear power plant, it is found that 52 of 100
urban residents favor the construction while only 34 of 125 suburban residents are
in favor. Find a 96% confidence interval for the difference between the
proportion of urban and suburban residents who favor construction of the nuclear
plant.
11. A cigarette-manufacturing firm claims that its brand A line of cigarettes outsells
its brand B line by 8%. If it is found that 42 of 200 smokers prefer brand A and
18 of 150 smokers prefer brand B, compute a 94% confidence interval for the
difference between the proportions of sales of the two brands.
12. A geneticist is interested in the proportion of males and females in the population
that have a certain minor blood disorder. In a random sample of 100 males, 24
113
are found to be afflicted, whereas 13 of the 100 females tested appear to have the
disorder. Compute a 99% confidence interval for the difference between the
proportion of males and females that have this blood disorder.
13. A study is made to determine if a cold climate results in more students being
absent from school during a semester than for a warmer climate. Two groups of
students are selected at random, one group from Vermont and the other group
from Georgia. Of the 300 students from Vermont, 64 were absent at least 1 day
during the semester, and of the 400 students from Georgia, 51 were absent 1 or
more days. Find a 95% confidence interval for the difference between the
fractions of the students who are absent in the two states.
During World War II, Allied military planners needed estimates of the number of tanks
Germany was manufacturing. The information provided by traditional spying methods
was not reliable, but statistical sampling methods proved to be valuable. For example,
espionage and reconnaissance led analysts to estimate 1550 tanks were produced during
June 1941. However, using the serial numbers of captured tanks and statistical analysis,
military planners estimated the number of tanks to be 244. This estimate turned out to be
27 less than the actual number manufactured by the Germans in June 1941. A similar
type of analysis was used to estimate the number of Iraqi tanks destroyed during Desert
Storm.
Maximum number of unrelated numbers the average person can store in short-term
memory: 7 2
Amount of time short-term memory retains this information: 15 to30 seconds
^
Thm6 If p is the proportion computed from a random sample of size n taken from a
(large or infinite) population with mean p and variance pq, then the sampling
^ ^
distribution of p is approximately normally distributed with mean E( p ) = p and
^
variance Var( p ) = pq/n when n is sufficiently large. Hence, the limiting form of
^
p− p
the distribution of as n approaches infinity is the standard normal
pq / n
distribution.
Thm7 If independent random samples of size n1 and n2 are drawn from two large or
infinite populations with means p1 and p2 and standard deviations p1q1 and p2q2,
^ ^
p1 − p 2
respectively, then has an approximate standard normal
^ ^ ^ ^
p1 q1 p 2 q 2
+
n1 n2
distribution.
114
8.4 Estimation of µ
a. when σ is known
( )
X z / 2 n where zα/2 is the z-value leaving an area of α/2 to the right.
b. when σ is unknown, n ≤ 30
( )
X t / 2,v sn where tα/2 is the t-value with v = n - 1 degrees of freedom
leaving an area of α/2 to the right.
Remarks:
1. The above formulas hold strictly for random samples from a normal distribution.
However, they provide good approximate (1-α)100% confidence intervals when
the distribution is not normal provided the sample size is large, i.e. n > 30.
2. ( )
If is unknown and n > 30, use X z / 2 Sn where zα/2 is the z-value leaving an
area of α/2 to the right.
Examples:
1. The mean and standard deviation for the quality grade-point averages of a random
sample of 36 college seniors are calculated to be 2.6 and .3, respectively. Find the
95% and 99% confidence intervals for the mean of the entire senior class.
2. The contents of 7 similar containers of sulfuric acid are 9.8, 10.2, 10.4, 9.8, 10.0,
10.2, and 9.6 liters. Find a 95% confidence interval for the mean content of all
such containers, assuming an approximate normal distribution.
z
2
n = /2
e
115
116
Exercises: pp. 262-264 of Walpole nos. 3-13
3. An electrical firm manufactures light bulbs that have a length of life that is
normally distributed, with a standard deviation of 40 hours. If a random sample of
30 bulbs has a mean life of 780 hours, find a 96% confidence interval for the
population mean of all bulbs produced by this firm.
4. A soft-drink machine is regulated so that the amount of drink dispensed is
approximately normally distributed with a standard deviation of 1.5 dl. Find a
95% confidence interval for the mean of all drinks dispensed by this machine if a
random sample of 36 drinks had an average content of 22.5 dl.
5. The heights of a random sample of 50 college students showed a mean of 174.5
cm. and a standard deviation of 6.9 cm. Construct a 98% confidence interval for
the mean height of all college students.
6. A random sample of 100 automobile owners shows that an automobile is driven
on the average 23,500 kilometers per year, in the state of Virginia, with a standard
deviation of 3900 kilometers. Construct a 99% confidence interval for the average
distance an automobile is driven annually in Virginia.
7. How large a sample is needed in Exercise 3 if we wish to be 96% confident that
our sample mean will be within 10 hours of the true mean?
8. How large a sample is needed in Exercise 4 if we wish to be 95% confident that
our sample mean will be within .3 ounce of the true mean?
9. An efficiency expert wishes to determine the average time that it takes to drill 3
holes in a certain metal clamp. How large a sample will he need to be 95%
confident that his sample mean will be within 15 sec. of the true mean? Assume
that it is known from previous studies that = 40 sec.
10. Regular consumption of presweetened cereals contribute to tooth decay, heart
disease, and other degenerative diseases, according to a study by Dr. M. Albreight
of the National Institute of Health and Dr. D. Solomon, Professor of Nutrition and
Dietetics at the University of London. In a random sample of 20 similar servings
of Alpha-Bits, the mean sugar content was 11.3 grams with a standard deviation
of 2.45 grams. Assuming that the sugar content is normally distributed, construct
a 95% confidence interval for the mean sugar content for single servings of
Alpha-Bits.
11. The contents of 10 similar containers of a commercial soap are 10.2, 9.7, 10.1,
10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 liters. Find a 99% confidence interval for
the mean soap content of all such containers, assuming an approximate normal
distribution.
12. A random sample of 8 cigarettes of a certain brand has an average nicotine
content of 3.6 mg. and a standard deviation of .9 mg. Construct a 99% confidence
interval for the true average nicotine content of this particular brand of cigarettes,
assuming an approximate normal distribution.
13. A random sample of 12 female students in a certain dormitory showed an average
weekly expenditure of $8.00 for snack foods, with a standard deviation of $1.75.
Construct a 90% confidence interval for the average amount spent each week on
snack foods by female students living in this dormitory, assuming the
expenditures to be approximately normally distributed.
117
8.5 Inferences About 1 - 2
In comparing two populations with means 1 and 2 and standard deviations 1 and 2,
respectively, the analysis of the sample data depends on the sampling design used.
Defn Two samples are called independent samples when the measurements in one
sample are not related to the measurements in the other sample.
• Random samples are taken separately from two populations and the same
response variable is recorded for each individual
• One random sample is taken and a variable recorded for each individual, but then
units are categorized as belonging to one population or another
• Participants are randomly assigned to one of two treatment conditions and the
same response variable is recorded for each individual unit
Defn The term paired (or matched/related/dependent) data means that data have been
observed in natural pairs.
• Each person is measured twice. The two measurements of the same characteristic
or trait are made under different situations.
• Similar individuals are paired prior to the experiment. During the experiment,
each member of a pair receives a different treatment. The same response variable
is measured for all individuals.
Example An independent samples design and a matched samples design are under
consideration for a study to obtain an estimate of the weight loss in a
shipment of bananas during transit.
Independent Samples. A random sample of banana bunches is selected from the lot and
weighted before loading. After shipment, an independent random sample of bunches is
selected and weighed during the unloading. The difference in the two sample mean
weights per bunch is used as the estimate of weight loss per bunch.
Matched Samples. A random sample of banana bunches is selected and weighted before
loading. After shipment, the same bunches selected before loading are weighed again,
and the difference in weight for each bunch is noted. The mean of these differences is
used as an estimate of the weight loss per bunch.
Exercise: For each item, identify whether the samples are independent or not.
1. A police department performs an experiment to assess the effects of an obvious
radar trap on the speeds of cars. Ten cars are randomly selected on a highway,
and their speeds are measured just before a radar trap comes into view and just
after they pass the trap.
118
2. A tire manufacturer is testing 2 new tread designs in terms of stopping distance.
To do this, the company uses two test cars driving side by side at the same speed.
Both cars have automatic braking systems so that both sets of brakes engage
simultaneously on a signal. Then the stopping distances for both cars are
measured after repeating the experiment 10 times.
3. A company which does a large volume of business by mail decides to test whether
there is a difference in mail delivery between those items brought to a post office
as compared to those put in a corner mailbox. A random sample of 100 customers
from the same city was selected and their letters were mailed from the post office.
Another random sample of 100 customers was then selected but their letters were
sent from the corner mailbox.
4. Two formulations of a new skin-softening lotion are to be compared as to their
softening action. A random sample of 40 potential users of the lotion is selected.
Each person in the sample is independently assigned at random one of the two
formulations to be applied to the left arm and the other formulation to be applied
to the right arm. After a lapse of eight hours, each person is asked to rate the
skin-softening effect of each formulation on a 10-point scale.
If we have two populations with means 1 and 2 and standard deviations σ1 and σ2,
respectively, a point estimator of the difference 1 - 2 is the statistic X 1 − X 2 .
Thm8 If independent random samples of size n1 and n2 are drawn from two large or
infinite populations with means µ1 and µ2 and standard deviations σ1 and σ2,
119
respectively, then the sampling distribution of the difference of means X 1 − X 2
is approximately normally distributed with mean and standard deviation given by
12 22
X 1−X 2
= 1 − 2 and X 1 − X 2 = +
n1 n2
Hence z =
(x 1 )
− x 2 − ( 1 − 2 )
is a value of the standard normal variable Z.
/ n1 + 22 / n2
2
1
Thm9 If independent random samples of size n1 and n2 30 are drawn from two normal
populations with means µ1 and µ2 and variances 12 = 22 , respectively, then
(X 1 − X 2 ) (n1 − 1) S12 + (n2 − 1) S 22
, where S p2 = , has a t-distribution
Sp (1 n1 ) + (1 n2 ) n1 + n2 − 2
with v = n1 + n2 − 2 degrees of freedom.
Thm10 If independent random samples of size n1 and n2 30 are drawn from two normal
populations with means µ1 and µ2 and variances 12 22 , respectively, then
(X 1 − X 2 ) ( S12 n1 + S 22 n2 ) 2
has a t-distribution with v = degrees
( S12 n1 ) + ( S 22 n2 ) ( S12 n1 ) 2 ( S 22 n2 ) 2
+
n1 − 1 n2 − 1
of freedom.
Thm11 If independent random samples of size n1, n2 >30 are drawn from two populations
with means µ1 and µ2 and standard deviations σ1 and σ2, respectively, then
( )
x 1 − x 2 − ( 1 − 2 )
has an approximate standard normal distribution.
s12 / n1 + s 22 / n 2
120
(1-α)100% Confidence Interval for µ1-µ2 for Independent Samples
a. 12 and 22 are known
12 22
X − X z +
1 2 /2
n1 n 2
b. 12 = 22 but unknown, n1,n2 ≤ 30
(n1 − 1)S12 + (n2 − 1)S 22 1
X − X t +
1
1 2 / 2 , n1+ n 2 − 2
n1 + n 2 − 2 n1 n 2
c. 12 ≠ 22 but unknown, n1,n2 ≤ 30
S12 S 22
X − X t +
1 2 / 2 ,v
n1 n 2
Where v = 2
(S
/ n1 + S 22 / n2
1
2
)2
( S1 / n1 ) 2 ( S 22 / n2 ) 2
+
n1 − 1 n2 − 1
Remarks:
1. These formulas hold strictly for independent samples selected from Normal
populations. However, they provide good approximate (1-α)100% confidence
intervals when the distributions are not Normal provided both n1 and n2 are greater
than 30.
2. If 12 and 22 are unknown but n1 and n2 are greater than 30, use
S12 S 22
X − X z +
1 2 /2
n1 n 2
3. Even if the population variances are considerably different, formula (b) will still
provide a good estimate provided that n1=n2 and both populations are normal.
Therefore, in a planned experiment, one should make every effort to equalize the
size of the samples.
Examples:
1. A statistics test was given to a random sample of 50 girls and another random
sample of 75 boys. The mean score of the girls is 80 with a standard deviation of
4 and the mean score of the boys is 86 with a standard deviation of 6. Find a 95%
confidence interval for the difference B - G.
2. A course in mathematics is taught to 12 students by the conventional classroom
procedure. A second group of 10 students was given the same course by means of
programmed materials. At the end of the semester the same examination was
given to each group. The 12 students meeting in the classroom made an average
grade of 85 with a standard deviation of 4, while the 10 students using
programmed materials made an average of 81 with a standard deviation of 5.
121
Find a 90% confidence interval for the difference between the population means,
assuming the populations are approximately normal with equal variances.
3. Records for the past 15 years have shown the average rainfall in a certain region
to be 4.93 cm., with a standard deviation of 1.14 cm. A second region has had an
average rainfall of 2.64 cm., with a standard deviation of .66 cm. during the past
10 years. Find a 95% confidence interval for the difference of the true average
rainfalls in these regions, assuming that the observations come from normal
populations with different variances.
When the population of differences D is normal or does not depart too markedly from
normality, a confidence interval for D = x - y is:
S
d t / 2,n −1 d
n
Where d i = xi − y i
2
(d )
n n n
n
d n d − d i
2
−d
2
i i i
d= i =1
, Sd = i =1
= i =1 i =1
n n −1 n(n − 1)
n is the number of pairs
Example: Twenty college freshmen were divided into 10 pairs, each member of the pair
having approximately the same IQ. One of each pair was selected at random and
assigned to a mathematics section using programmed materials only. The other
member of each pair was assigned to a section in which the professor lectured. At
122
the end of the semester each group was given the same examination and the
following results were recorded.
Pair 1 2 3 4 5 6 7 8 9 10
Programmed 76 60 85 58 91 75 82 64 79 88
Materials
Lectures 81 52 87 70 86 77 90 63 85 83
Find a 98% confidence interval for the true difference in the two learning
procedures. Assume normality.
Thm12 If d = X − Y and S d2 are the mean and variance, respectively, of a random sample
of size n taken from a population of differences which is normally distributed
d − D
with mean D = x - y and variance D2 , then is a random variable
Sd / n
having the t - distribution with v = n-1 degrees of freedom.
123
courses. Assume the populations to be approximately normally distributed with
equal variances.
19. A taxi company is trying to decide whether to purchase brand A or brand B tires
for its fleet of taxis. To estimate the difference in the two brands, an experiment
is conducted using 12 of each brand. The tires are run until they wear out. The
results are x A = 36,300 km., sA = 5000 km., x B = 38,100, and sB = 6100 km.
Construct a 95% confidence interval for µA-µB, assuming the populations to be
approximately normally distributed.
20. The following data represent the running time of a random sample of films
produced by two motion picture companies:
Time (minutes)
Company 1 103 94 110 87 98
Company 2 97 82 123 92 175 88 118
Compute a 90% confidence interval for the difference between the mean running
times of films produced by the two companies. Assume that the running times for
each of the companies are approximately normally distributed with unequal
variances.
University
1 2 3 4 5 6 7 8 9
Variety 1 38 23 35 41 44 29 37 31 38
Variety 2 45 25 31 38 50 33 36 40 43
Find a 95% confidence interval for the mean difference between the yields of the
two varieties assuming the distributions of yields to be approximately normal.
22. Referring to Exercise 19, find a 99% confidence interval for µA-µB if a tire from
each company is assigned at random to the rear wheels of 8 taxis and the
following distances in km., recorded:
124
23. It is claimed that a new diet will reduce a person’s weight by 4.5 kilograms on the
average in a period of 2 weeks. The weights of a random sample of 7 women who
followed this diet were recorded before and after a 2-week period:
Woman
1 2 3 4 5 6 7
Weight Before 58.5 60.3 61.7 69.0 64.0 62.6 56.7
Weight After 60.0 54.9 58.1 62.1 58.5 59.9 54.4
Compute a 95% confidence interval for the mean difference in the weight.
Assume the distribution of weights to be approximately normal.
125
CHAPTER 9
Tests of Hypothesis
A two-tailed test of hypothesis is a test where the alternative hypothesis does not
specify a directional difference for the parameter of interest.
Examples:
a. Is there a general preference for Coke or Pepsi?
Ho: p = .5 vs. Ha: p ≠ .5
b. Is the proportion favoring death penalty the same for teenagers as it is for
adults? Ho: pT - pA = 0 vs. Ha: pT - pA 0
126
7. The Type I error is the error made by rejecting the null hypothesis when it is
true. The probability of a Type I error is denoted by α.
The Type II error is the error made by accepting (not rejecting) the null
hypothesis when it is false. The probability of a Type II error is denoted by β.
Null Hypothesis
Decision True False
Reject Ho Type I error Correct decision
Accept Ho Correct decision Type II error
z-test Decision
Possible Samples of Size n=2 Ho:µ=2 α=.10
0 0 -2 Reject Ho
0 1 -1.5 Accept Ho
0 2 -1 Accept Ho
0 3 -0.5 Accept Ho
0 4 0 Accept Ho
1 0 -1.5 Accept Ho
1 1 -1 Accept Ho
1 2 -0.5 Accept Ho
1 3 0 Accept Ho
1 4 0.5 Accept Ho
2 0 -1 Accept Ho
2 1 -0.5 Accept Ho
2 2 0 Accept Ho
2 3 0.5 Accept Ho
2 4 1 Accept Ho
3 0 -0.5 Accept Ho
3 1 0 Accept Ho
3 2 0.5 Accept Ho
3 3 1 Accept Ho
3 4 1.5 Accept Ho
4 0 0 Accept Ho
4 1 0.5 Accept Ho
4 2 1 Accept Ho
4 3 1.5 Accept Ho
4 4 2 Reject Ho
127
The Type I error and Type II error are related. For a fixed sample size n, a
decrease in the probability of one will result in an increase in the probability of
the other. However, increasing the sample size will result in the reduction of both
probabilities.
Common Consequences of
Choices
Type I Error Type II Error
Of α
.01 Very serious Not too serious
Example If you are on a jury in the American judicial system, you must presume
that the defendant is innocent unless there is enough evidence to conclude that he
or she is guilty. Therefore the two hypotheses are
Ho: The defendant is innocent
Ha: The defendant is guilty
The prosecution collects evidence in the hope that the jurors will be convinced
that such evidence would be extremely unlikely if the assumption of innocence
were true. Consistent with our thinking in hypothesis testing, in many cases we
would not accept the hypothesis that the defendant is innocent. We would simply
conclude that the evidence was not strong enough to rule out the possibility of
innocence. In fact, in the United States the two conclusions juries are instructed
to choose from are “guilty” and “not guilty.” A jury would never conclude, “the
defendant is innocent.”
For trials in general, here are the possible errors and the consequences that
accompany those errors:
Type I error: A “guilty” verdict for a person who is really innocent.
Consequence: An innocent person is falsely convicted. The guilty party remains
free.
Type II error: A “not guilty” verdict for a person who committed a crime.
Consequence: A criminal is not punished.
In the American court system, a false conviction is generally viewed as the more
serious error. Not only is an innocent person punished but also a guilty one
remains free. Courtroom rules and rules affecting pretrial investigations tend to
reflect society’s concern about incorrectly punishing an innocent person.
128
Example Imagine that you are tested to determine if you have a disease. The lab
technician or physician who evaluates your results must make a choice between
two hypotheses:
Ho: You do not have the disease.
Ha: You have the disease.
Unfortunately, many laboratory tests for diseases are not 100% accurate. There is
a chance the result is wrong. Consider the two possible errors and their
consequences:
Type I error: You are told you have the disease, but you actually don’t. The test
result was a false positive. Consequence: You will be unnecessarily concerned
about your health and you may receive unnecessary treatment.
Type II error: You are told you do not have the disease, but you actually do. The
test result was a false negative. Consequence: You do not receive treatment for a
disease that you have. If this is contagious, you may infect others.
Which error is more serious? In most medical situations, the second possible
error is more serious but this could depend on the disease and the follow-up
actions that are taken. For instance, in a screening test for cancer, a false negative
could lead to a fatal delay in treatment. Initial test results that are “positive” for
cancer are usually followed up with a retest so a false positive may be discovered
quickly.
Consider the problem of testing the hypothesis that the proportion of successes in a
binomial experiment equals some specified value.
If the unknown proportion is not expected to be too close to 0 or 1 and n is large, a large
sample approximation is given by:
129
Example A commonly prescribed drug on the market for relieving nervous tension
is believed to be only 60% effective. Experimental results with a new drug
administered to a random sample of 100 adults who were suffering from
nervous tension showed that 70 received relief. Is this sufficient evidence
to conclude that the new drug is superior to the one commonly prescribed?
Use a 0.05 level of significance.
The testing procedure involves selection of independent samples of size n1 and n2 from
two binomial populations. The sample proportions p1 and p 2 are computed and the test is
as follows:
Example In a survey of 200 students, 78 of the 120 females in the sample passed
Math 17 on their first take while this figure is 60 among the 80 males. Will
you agree that the proportion of males who passed Math 17 on their first
take is higher than the proportion of females who passed the same course
on their first take? Test at α = 0.05.
130
Exercises: p. 331 of Walpole nos. 1-12
1. A manufacturer of cigarettes claims that 20% of the cigarette smokers prefer
brand X. To test this claim a random sample of 20 cigarette smokers is selected
and asked what brand they prefer. If 6 of the 20 smokers prefer brand X, what
conclusion do we draw? Use a .05 level of significance.
2. Suppose that in the past 40% of all adults favored capital punishment. Do we
have reason to believe that the proportion of adults favoring capital punishment
today has increased if, in a random sample of 15 adults, 8 favor capital
punishment? Use a .05 level of significance.
3. A coin is tossed 20 times resulting in 5 heads. Is this sufficient evidence to reject
the hypothesis at the .03 level of significance that the coin is balanced in favor of
the alternative that head occur less than 50% of the time?
4. It is believed that at least 60% of the residents in a certain area favor an
annexation suit by a neighboring city. What conclusion would you draw if only
110 in a sample 200 voters favor the suit? Use a .04 level of significance.
5. The gas company claims that two-thirds of the houses in a certain city are heated
by natural gas. Do we have reason to doubt this claim if, in a random sample of
1000 houses in this city, it is found that 618 are heated by natural gas? Use a .02
level of significance.
6. At a certain college it is estimated that fewer than 25%of the students have cars
on campus. Does this seem to be a valid estimate if, in a random sample of 90
college students, 28 are found to have cars? Use a .05 level of significance.
7. In a study to estimate the proportion of residents in a certain city and its suburbs
who favor the construction of a nuclear power plant, it is found that 63 of 100
urban residents favor the construction while only 59 of 125 suburban residents are
in favor. Is there a significant difference between the proportion of urban and
suburban residents who favor construction of the nuclear plant? Use a .04 level of
significance.
8. A cigarette-manufacturing firm distributes two brands of cigarettes. If it is found
that 56 of 200 smokers prefer brand A and 29 of 150 smokers prefer brand B, can
we conclude at the .06 level of significance that brand A outsells brand B?
9. A geneticist is interested in the proportion of males and females in the population
that have a certain minor blood disorder. In a random sample of 100 males, 31
are found to be afflicted, whereas 24 of the 100 females tested appear to have the
disorder. Can we conclude at the .01 level of significance that the proportion of
men in the population afflicted with this blood disorder is significantly higher
than the proportion of women afflicted?
10. A study is made to determine if a cold climate results in more to absenteeism
from school during a semester than a warmer climate. Two groups of students are
selected at random, one group from Maine and the other from Alabama. Of the
300 students from Maine, 72 were absent at least 1 day during the semester, and
of the 400 students from Alabama, 70 were absent 1 or more days. Can we
conclude that a colder climate results in a greater number of students being absent
from school at least 1 day during the semester? Use a .05 level of significance.
11. A vote is to be taken among the residents of a town and the surrounding country
to determine whether a civic center will be constructed. The proposed
131
construction site is within the town limits and for this reason many voters in the
country feel that the proposal will pass because of the large proportion of town
voters who favor the construction. If 120 of 200 town voters favor the proposal
and 240 of 500 country residents favor it, test the hypothesis that the percentage
of town voters favoring the construction of a civic center will not exceed the
percentage of country voters by more that 3%. Use a .025 level of significance.
12. With reference to Exercise 8, test the hypothesis at the .06 level of significance
that brand A outsells brand B by at least 10%.
For the same data set, as α increases the size of the critical region also increases.
Consequently, if Ho is rejected at α-level of significance then Ho will also be
rejected at a higher level of significance using the same data. For example, if Ho
is rejected at α = 0.05 then testing at α = 0.1 will also lead to the rejection of Ho.
However, Ho will not necessarily be rejected at α = 0.01.
An alternative way to report the results of the test is to compute the p-value. The
p-value is the smallest value of α for which Ho will be rejected based on sample
information. Reporting the p-value will allow the reader of the published research
to evaluate the extent to which the data disagree with Ho. In particular, it enables
each reader to choose their personal value of α
132
In 1788, James Madison, John Jay, and Alexander Hamilton anonymously published a
series of essays entitled The Federalist. These Federalist papers were an attempt to
convince the people of New York that they should ratify the Constitution. In the course
of history, the authorship of these papers became known, but 12 remained contested.
Through the use of statistical analysis, and particularly the use the frequency of the use of
various words, we can now conclude that Madison is the likely author of the 12 papers.
In fact, the statistical evidence that he is the author is overwhelming.
• The above tests are exact -level tests for samples from a normal distribution.
However, the first test provides a good approximate -level test when the distribution
is not normal provided that the sample size is large enough, that is, n>30. See
Theorems 3 and 4 of Chapter 7.
• If 2 is unknown and n>30, use the z-test but replace by s, that is,
X − o
Z=
s/ n
Ho: > o vs. Ha: < o as Ho: =o vs. Ha: < o
Ho: < o vs. Ha: > o as Ho: =o vs. Ha: > o
Examples:
1. A manufacturer of sports equipment has developed a new synthetic fishing line
that he claims has a mean breaking strength of 8 kilograms with a standard
deviation of .5 kilogram. Test the hypothesis that µ = 8 kilograms if a random
sample of 50 lines is tested and found to have a mean breaking strength of 7.8
kilograms. Use a .01 level of significance.
133
2. A random sample of 100 recorded deaths during the past year showed an average
life span of 71.8 years, with a standard deviation of 8.9 years. Does this seem to
indicate that the average life span today is greater than 70 years? Use a .05 level
of significance.
3. The average length of time for students to register at a certain college has been 50
minutes with a standard deviation of 10 minutes. A new registration procedure
using modern computing machines is being tried. If a random sample of 12
students had an average registration time of 42 minutes with a standard deviation
of 11.9 minutes under the new system, test the hypothesis that the population
mean is now less than 50 minutes, using a level of significance of .10, .05 and .01.
Assume the population of times to be normal.
134
Exercises: pp. 315-316 of Walpole nos.1-8
1. An electrical firm manufactures light bulbs that have a length of life that is
approximately normally distributed with a mean of 800 hours and a standard
deviation of 40 hours. Test the hypothesis that µ = 800 hours against the
alternative µ 800 hours if a random sample of 30 bulbs has an average life of
788 hours. Use a .04 level of significance.
2. In a research report by Richard H. Weindruch of the UCLA Medical School, it is
claimed that mice with an average lifespan of 32 months will live to be about 40
months old when 40% of the calories in their food are replaced by vitamins and
minerals. Is there any reason to believe that µ < 40 if 64 mice that are placed on
this diet have an average life of 38 months with a standard deviation of 5.8
months? Use a .025 level of significance.
3. The average height of females in the freshman class of a certain college has been
162.5 cm. with a standard deviation of 6.9 cm. Is there reason to believe that
there has been a change in the average height if a random sample of 50 females in
the present freshman class has an average height of 165.2 cm.? Use a .02 level of
significance.
4. It is claimed that an automobile is driven on the average less than 25,000
kilometers per year. To test this claim, a random sample of 100 automobile
owners is asked to keep a record of the kilometers they travel. Would you agree
with this claim if the random sample showed an average of 23,500 kilometers and
a standard deviation of 3,900 kilometers? Use a 0.01 level of significance.
5. Test the hypothesis that the average content of containers of a particular lubricant
is 10 liters if the contents of a random sample of 10 containers are 10.2, 9.7, 10.1,
10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 liters. Use a .01 level of significance and
assume that the distribution of contents is normal.
6. According to Dietary Goals for the United States (1977), high sodium intake may
be related to ulcers, stomach cancer, and migraine headaches. The human
requirement for salt is only 230 milligrams per day, which is surpassed in most
single servings of ready-to-eat cereals. A random sample of 20 similar servings of
Special K had mean sodium content of 244 milligrams of sodium and a standard
deviation of 24.5 milligrams. Is there sufficient evidence to believe that the
average sodium content for single servings of Special K exceeds the human
requirement for salt at α = .05? Assume normality.
7. A random sample of 8 cigarettes of a certain brand has an average nicotine
content of 4.2 mg. and a standard deviation of 1.4 mg. Is this in line with the
manufacturer’s claim that the average nicotine content does not exceed 3.5 mg.?
Use a .01 level of significance and assume the distribution of nicotine contents to
be normal.
8. Last year the employees of a city sanitation department donated an average of
$8.00 to the volunteer rescue squad. Test the hypothesis at the .01 level of
significance that the average contribution this year is still $8.00 if a random
sample of 12 employees showed an average donation of $8.90 with a standard
deviation of $1.75. Assume the donations are approximately normally
distributed.
135
9.5 TESTING THE DIFFERENCE BETWEEN TWO POP’N MEANS
• Based on 2 independent samples
Ho Test Statistic Ha Critical region
a. and 2 known
2
1
2
Remark The remarks made on confidence interval estimation for the difference
between means relative to the use of a given statistic apply to the tests
described here. See Theorem 5 of Chapter 7.
If 12 and 22 are unknown but n1, n2 > 30, use the z-test described in (a)
but replace the population standard deviations by the sample standard
deviation, that is,
(X 1 − X 2 ) − do
z=
( S12 n1 ) + ( S 22 n 2 )
If 12 and 22 are unknown and n1 = n2 ≤ 30, use (b)
136
Testing the Difference Between two Population Means Based on two Related Samples
Example: A taxi company is trying to decide whether the use of radial tires instead of
regular belted tires improves fuel economy. Twelve cars were equipped with
radial tires and driven over a prescribed test course. Without changing drivers,
the same cars were then equipped with regular belted tires and driven once again
over the test course. The gasoline consumption, in kilometers per liter, was
recorded as follows:
137
Exercises: pp. 317-318 of Walpole nos. 10-19
10. A random sample of size n1 = 25, taken from a normal population with standard
deviation 1 = 5.2, has a mean x1 = 81. A second random sample of size n2 = 36,
taken from a different normal population with a standard deviation 2 = 3.4, has a
mean x 2 = 76. Test the hypothesis at the .06 level of significance that µ1 = µ2.
11. A manufacturer claims that the average tensile strength of thread A exceeds the
average tensile strength of thread B by at least 12 kg. To test this claim, 50 pieces
of each type of thread are tested under similar conditions. Type A thread had an
average tensile strength of 86.7 kg. with a standard deviation of 6.28 kg., while
type B thread had an average tensile strength of 77.8 kg. with a standard deviation
of 5.61 kg. Test the manufacturer’s claim using a .05 level of significance.
12. A study was made to estimate the difference in salaries of college professors in
the private and state colleges of Virginia. A random sample of 100 professors in
private colleges showed an average 9-month salary of $26,000 with a standard
deviation of $1300. A random sample of 200 professors in state colleges showed
an average salary of $26,900 with a standard deviation of $1400. Test the
hypothesis that the average salary for professors teaching in state colleges does
not exceed the average salary for professors teaching in private colleges by more
than $500. Use a .02 level of significance.
13. Given two random samples of size n1 = 11 and n2 = 14, from two independent
normal populations, with x1 = 75, x 2 = 60, s1 = 6.1, and s2 = 5.3, test the
hypothesis at the .05 level of significance that µ1 = µ2. Assume the population
variances are equal.
14. A study is made to see if increasing the substrate concentration has an appreciable
effect on the velocity of a chemical reaction. With the substrate concentration of
1.5 moles per liter, the reaction was run 15 times with an average velocity of 7.5
micromoles per 30 min. and a standard deviation of 1.5. With a substrate
concentration of 2.0 moles per liter, 12 runs were made yielding an average
velocity of 8.8 micromoles per 30 min. and a standard deviation of 1.2. Would
you say that the increase in substrate concentration increases the mean velocity by
more than .5 micromole per 30 min.? Use a .01 level of significance and assume
the populations to be approximately normally distributed with equal variances.
15. A study was made to determine if the subject matter in a physics course is better
understood when a lab constitutes part of the course. Students were allowed to
choose between a 3-unit course without lab and a 4-unit course with lab. In the
section with lab, a sample of 11 students had an average grade of 85 with a
standard deviation of 4.7, and in the section without lab, a sample of 17 students
had an average grade of 79 with a standard deviation of 6.1. Would you say that
the laboratory course increases the average grade by more than 5 points? Use a
0.01 level of significance and assume the populations to be approximately
normally distributed with equal variances.
16. A large automobile manufacturing company is trying to decide whether to
purchase brand A or brand B tires for its new models. To help arrive at a
decision, an experiment is conducted using 12 of each brand. The tires are run
until they wear out. The results are x A = 37,900 km., sA = 5100 km., x B = 39,800
138
km, and sB = 5900 km. Test the hypothesis at the .05 level of significance that
there is no difference in the two brands of tires. Assume the populations to be
approximately normally distributed.
17. The following data represent the running time of films produced by two motion
picture companies:
Time (minutes)
Company 1 103 94 110 87 98
Company 2 97 82 123 92 175 88 118
Test the hypothesis that the average running time of films produced by company 2
exceeds the average running time of films produced by company 1 by 10 minutes
against the one-sided alternative that the difference is more than 10 minutes. Use
a 0.1 level of significance and assume the distributions of times to be
approximately normal with unequal variances.
Company 2 Company 1
Mean 110.7142857 98.4
Variance 1035.904762 76.3
Observations 7 5
Hypothesized Mean
Difference 10
Df 7
t Stat 0.181131997
P(T<=t) one-tail 0.430698476
t Critical one-tail 1.414923928
P(T<=t) two-tail 0.861396953
t Critical two-tail 1.894578604
18. In Exercise 21 on p. 265, test the hypothesis, at the .05 level of significance, that
the average yields of the two varieties of wheat are equal.
University
1 2 3 4 5 6 7 8 9
Variety 1 38 23 35 41 44 29 37 31 38
Variety 2 45 25 31 38 50 33 36 40 43
19. In Exercise 22 on p. 265, test the hypothesis, at the .01 level of significance, that
µ1 ≥ µ2.
Taxi Brand A Brand B
1 34,400 36,700
2 45,500 46,800
3 36,700 37,700
4 32,000 31,100
5 48,400 47,800
6 32,800 36,400
7 38,100 38,900
8 30,100 31,500
139
Before the election of 1936, a contest between Democratic incumbent Franklin Roosevelt
and Republican Alf Landon, the magazine Literary Digest had been extremely successful
in predicting the results in the US presidential elections. But 1936 turned out to be its
downfall, when it predicted a victory for Landon. To add insult to injury, young pollster
George Gallup, who had just founded the American Institute of Public Opinion in 1935,
correctly predicted Roosevelt as the winner of the election. He did this before they even
conducted their poll! And Gallup surveyed only 50,000 people, while the Literary Digest
sent questionnaires to 10 million people.
The Literary Digest made two classic mistakes. First, the lists of people to whom they
mailed the 10 million questionnaires were taken from magazine subscribers, car owners,
telephone directories, and lists of registered voters. In 1936, those who owned telephones
or cars, or subscribed to magazines, were more likely to be wealthy individuals who were
not happy with the Democratic incumbent. This is clear case of under-coverage.
Despite what accounts of this famous story conclude, the bias produced by the more
affluent list was not likely to have been as severe as the second problem. The main
problem was volunteer response. They received 2.3 million responses, a response rate of
only 23%. Those who felt strongly about the outcome of the election were more likely to
respond and that included a majority of those who wanted a change, the Landon
supporters. Those who were happy with the incumbent were less likely to bother to
respond.
Gallup, on the other hand, knew the value of random sampling. He was not only able to
predict the election but he also predicted what the results of the Literary Digest poll
would be to within 1%. How did he do this? He just chose 3,000 people at random from
the same lists the Digest was going to use, and mailed them all a postcard asking them
how they planned to vote.
140
9.6. TEST FOR INDEPENDENCE
The test for independence is used to determine whether two variables are related or not.
For example, we might test whether a person’s music preference is related to his
intelligence quotient. We then take a random sample and for each subject determine their
music preference and classify their IQ’s into different categories (high, medium, low).
The observed frequencies are presented in what is known as a contingency table shown
below:
Music IQ
Preference High Medium Low Total
Classical 40 26 17 83
Pop 47 59 25 131
Rock 83 104 79 266
Total 170 189 121 480
A contingency table containing r rows and c columns is referred to as an rxc table. The
row and column totals are called marginal frequencies. Note that in a test for
independence, these marginal frequencies are not fixed in advance but depends instead on
the way the sample distributed itself across the various cells in the table.
Procedure:
1. State the null and alternative hypothesis.
Ho: The two variables are independent
Ha: The two variables are not independent.
2. Choose the level of significance.
3. Compute the test statistic, given by
r c (O − E )2
=
2 ij ij
i =1 j =1 Eij
where Oij= observed number of cases in the ith row of the jth column
Eij = expected number of cases under Ho
( =
) ( )
i th row total j th column total
grand total
4. Decision Rule: Reject Ho if 2 ,( r −1)(c −1)
2
Remarks:
1. The test is valid if at least 80% of the cells have expected frequencies of at least 5
and no cell has an expected frequency ≤ 1.
2. If many expected frequencies are very small, researchers commonly combine
categories of variables to obtain a table having larger cell frequencies. Generally,
one should not pool categories unless there is a natural way to combine them.
141
3. For a 2x2 contingency table, a correction called Yates’ correction for continuity is
i =1 j =1 E ij
Music IQ
Preference High Medium Low Total
Classical 40 (29.4) 26 (32.7) 17 (20.9) 83
Pop 47 (46.4) 59 (51.6) 25 (33.0) 131
Rock 83 (94.2) 104 (104.7) 79 (67.1) 266
Total 170 189 121 480
r c (O − Eij )
2
Music IQ
Preference High Medium Low
Classical 40/83 = .48 26/83 = .31 17/83 = .20
Pop 47/131 = .36 59/131 = .45 25/131 = .19
Rock 83/266 = .31 104/266 = .39 79/266 = .30
Music IQ
Preference High Medium Low
Classical 40/171 = .23 26/189 = .14 17/121 = .14
Pop 47/171 = .27 59/189 = .31 25/121 = .21
Rock 83/171 = .49 104/189 = .55 79/121 = .65
142
( )
P v2 2 =
143
144
X. Linear Regression and Correlation
Student X Y 120
1 39 65
100
2 43 78
3 21 52 80
4 64 82 60
5 57 92
40
6 47 89
7 28 73 20
8 75 98 0
9 34 56 0 20 40 60 80
10 52 75
145
Thus, we assume that that for any given value of X the observed value of Y varies in a
random manner and possesses a probability distribution with mean E (Y | X ) .
Assumptions for the Probabilistic Model: For any given value of X, Y possesses a
normal distribution with a mean value E (Y | X ) = 0 + 1 X and with a variance of 2 .
Furthermore, any one value of Y is independent of every other value.
Least squares criterion: Choose as the “best-fitting” line the line that minimizes the
2
n
sum of squares for error SSE = Yi − Yi .
i =1
The method for finding the numerical values of 0 and 1 that minimize SSE uses
differential calculus and is beyond the scope of this course.
( )( )
S xy n
1 = where S xy = xi − x yi − y
S xx i =1
0 = y − 1 x
For now, let us use the following EXCEL output to find the least squares prediction line
for the calculus grade-achievement test score data and predict a student’s calculus grade
if the student scored X = 50 on the achievement test.
Standard
Coefficients Error t Stat P-value
Intercept 40.78415521 8.506861379 4.794265875 0.00136551
X Variable 1 0.765561843 0.174984967 4.375014926 0.002364532
• The best-fitting straight line relating the calculus grade to the achievement test
score is Y = 0 + 1 X or Y = 40.78415521 + .765561843X.
• .765561843 is the estimated change in Y for a 1-unit change in X
• The Y intercept will not be interpreted since X = 0 is not part of the range of X
• If a student scores X = 50 on the achievement test, his or her predicted calculus
grade would be Y = 0 + 1 X = 40.78415521 + .765561843(50) = 79.06225
146
10.3 Inferences
The third parameter in our linear probabilistic model is 2 and its estimator is
2 = MSE = nSSE
−2
where MSE stands for mean squared error.
In the following EXCEL output, MSE = 75.75323363 can be found in the second row,
fourth column while SSE = 606.025869 can be found in the same row, third column.
ANOVA
Significance
df SS MS F F
Regression 1 1449.974131 1449.974131 19.1407556 0.002364532
Residual 8 606.025869 75.75323363
Total 9 2056
Does X contribute information for the prediction of Y; i.e., do the data provide sufficient
evidence to indicate that Y increases (or decreases) linearly as x increases over the region
of observation? We would wish to test Ho: 1 = 0 vs. Ha: 1 0 . The test statistic is
1
( )
n
where S xx = xi − x .
2
t=
MSE / S xx i =1
Standard
Coefficients Error t Stat P-value Lower 95% Upper 95%
Intercept 40.78415521 8.506861379 4.794265875 0.00136551 21.16729771 60.40101272
X Variable
1 0.765561843 0.174984967 4.375014926 0.002364532 0.362045786 1.169077901
147
10.4 Multiple Regression Models
S xy
( )
n
where S yy = yi − y
2
r=
S xx S yy i =1
S xy 1894
In the calculus grade-achievement test score data, r = = =
S xx S yy 2474 2056
.839786. The EXCEL output follows.
Column 1 Column 2
Column 1 1
Column 2 0.839786 1
S xy S xy
The denominators used in calculating r = and 1 = will always be
S xx S yy S xx
positive. Since the numerators are identical, r and 1 will assume the same sign.
148
r=1 -1 < r < 0
149
Computations
x y
x−x y−y (x − x ) (y − y ) xy (x − x )
2
(y − y )
2
x2 y2
39 65 -7 -11 77 2535 49 121 1521 4225
43 78 -3 2 -6 3354 9 4 1849 6084
21 52 -25 -24 600 1092 625 576 441 2704
64 82 18 6 108 5248 324 36 4096 6724
57 92 11 16 176 5244 121 256 3249 8464
47 89 1 13 13 4183 1 169 2209 7921
28 73 -18 -3 54 2044 324 9 784 5329
75 98 29 22 638 7350 841 484 5625 9604
34 56 -12 -20 240 1904 144 400 1156 3136
52 75 6 -1 -6 3900 36 1 2704 5625
460 760 1894 36854 2474 2056 23634 59816
n 10
( )( )
n
S xy = xi − x yi − y = 1894
i =1
( x ) 2
460 2
S xx = x − = 23634 −
2 i
i
n 10
( )
n
S xx = xi − x
2
= 2474
i =1
( y ) 2
760 2
S yy = y − = 59816 −
2 i
i
n 10
( )
n
S yy = yi − y
2
= 2056
i =1
S xy 1894
1 = = = 0.765561843
S xx 2474
760 460
0 = y − 1 x = − .765561843 = 40.78415521
10 10
S xy2 1894 2
SSE = S yy − = 2056 − = 606.025869
S xx 2474
^ 2 SSE 606.025869
= s 2 = MSE = = = 75.75323363
n−2 10 − 2
s2 75.75323363
Standard error of 1 = = = 0.174984967
S xx 2474
150
The statistical use of the term regression dates back to Francis Galton, who studied
heredity in the late 1800s. One of Galton’s interests was whether or not a man’s height as
an adult could be predicted by his parents’ heights. He discovered that it could, but the
relationship was such that very tall parents tended to have children who were shorter than
they were and very short parents tended to have children who were taller than
themselves. He initially described this phenomenon by saying that there was “reversal to
mediocrity” but later changed the terminology to “regression to mediocrity.” Thereafter,
the technique of determining such relationships was called regression.
Regression Diagnostics
^
Let the residual d i = Yi − Yi be regarded as the observed error, as opposed to the unknown
and unobservable error, . If the model is appropriate for the data at hand, then d should
reflect the properties assumed for .
For any given value of X, Y has mean value E (Y | X ) = 0 + 1 X .Or, equivalently, E()
= 0. Visually, the scatter plot of Y on X can give us sufficient information on the
linearity of the relationship. Numerically, one can use the correlation coefficient r or the
coefficient of determination r2.
For any given value of X, Y (or, equivalently, ) has variance of 2 . Visually, a residual
plot of d versus X should form a horizontal band. Numerically, one can use Cook and
Weisberg’s method. As an ad hoc to this, divide the dataset into the group of low X
values and the group of high X values. Reject Ho: 12 = 22 if
Variance of d coming from low X values
F= < F1− / 2,n1−1,n 2−1 or F > F / 2,n1−1,n 2−1 .
Variance of d coming from high X values
For any given value of X, Y (or, equivalently, ) possesses a normal distribution. One
can use goodness-of-fit test on the residuals. Alternatives are the Wilk-Shapiro statistic
that is available in SAS and the Kolmogorov-Smirnov statistic in SPSS.
Any one value of Y (or, equivalently, ) is independent of every other value. This
assumption is immediately satisfied if the data were randomly selected from the
population.
As in previous topics, another issue is outlier and/or influential cases detection. For
multiple regression, one may have a problem with collinearity or correlated X’s.
151
XI. The Analysis of Variance
11.1 Introduction
Ho: 1 = 2 = = k vs.
Ha: at least one pair of means differ.
For example, is it true that the best students sit in the front of the classroom, or is that a
false stereotype? In surveys done in two statistics classes at the University of California
at Davis, students reported their GPAs and also answered the question, “Where do you
typically sit in the classroom (front, middle, back)?” In all, 384 students gave valid
responses to both questions, and among these students, 88 said that they typically sit in
the front, 218 said they typically sit in the middle, and 78 said they typically sit in the
back.
The sample means are Front = 3.2029, Middle = 2.9853, Back = 2.9194. The p-value of
the F-test is .001; we can reject the null hypothesis. Further analysis indicates a
significant difference between the mean GPA for the front-row sitters and the mean GPA
for the other students.
152
Suppose that you want to compare the mean size of health insurance claims submitted by
five groups of policy holders. Ten claims were randomly selected from among the claims
for each group. Do the data contained in the five samples provide sufficient evidence to
indicate a difference in the mean levels of claims among the five health groups? We look
for a single test of Ho: 1 = 2 = = 5 vs. Ha: at least one pair of means differ. We
assume that the observations within each sample population are normally distributed with
a common variance 2 .
The analysis of experimental data depends on the design of the experiment, which refers
to the way the data were collected. A very useful and relatively simple design called the
completely randomized design is one in which random samples are independently
selected from each of k populations. This design results in observations that are
classified only according to the population from which they came. For example, in
assessing voter preference concerning the next city/municipality election, we may wish to
select random samples of registered voters in each of k barangays within the
city/municipality.
( )
k ni
Total SS = S xx = xij − x
2
i =1 j =1
can be partitioned into sum of squares for treatments/groups (SST, a measure of variation
among sample means) and sum of squares for error (SSE, a measure of variation within
samples). Thus, Total SS = SST + SSE.
For now we will be guided by EXCEL outputs. Total SS = 84153818.88 can be found in
the second column, last row, SSE = 77411264.4 in the same column, second row, SST =
6742554.48 in the same column, first row.
153
ANOVA
Source of
Variation SS Df MS F P-value F crit
Between
Groups 6742554.48 4 1685638.62 0.979879847 0.428070522 2.578739184
Within Groups 77411264.4 45 1720250.32
Total 84153818.88 49
The third column refers to the degrees of freedom of each sum of square. For Total SS,
49 = n – 1, for SSE, 45 = n – k, and for SST, 4 = k – 1.
The fourth column is the mean squares column calculated by dividing the sum of squares
by its degrees of freedom. So mean square for treatments MST =
k −1 = = 1685638 .62 and mean square error MSE = nSSE
−k = = 1720250 .32 .
SST 6742554.48 77411264.4
5−1 50 −5
MSE
A (1-α)100% CI for a single treatment mean µi of the form x i t / 2,n −k
ni
1720250.32
For example, a 95% CI for µ4 is 1674.7 1.96 =
10
1674.7 812.9276493 = (861.7723507, 2487.627649).
SUMMARY
Groups Count Sum Average Variance
Column 1 10 19598 1959.8 2751539.289
Column 2 10 19961 1996.1 1915389.878
Column 3 10 11939 1193.9 703792.7667
Column 4 10 16747 1674.7 1137055.789
Column 5 10 22789 2278.9 2093473.878
For example, a 95% CI for µ1 - µ3 is (1959.8-1193.9) 1.96 1720250 .32 (101 + 101 ) =
765.9 1149.653307 = (-383.753307, 1915.553307).
154
Computing Formulas
Population
Statistic 1 2 k
Sample size n1 n2 nk
Total= xij T1 = x1 j T2 = x 2 j Tk = x kj
j j j j
Sample mean x1 = T1 / n1 x 2 = T2 / n2 x k = Tk / nk
T2
CM =
n
where T = total of all observations = x
i j
ij = Ti
i
( ) = x
k ni
Total SS = S xx = xij − x
2 2
ij − CM
i =1 j =1 i j
Ti 2
SST = i n − CM SSE = Total SS - SST
i
Consider the problem of assessing the effects of three different package designs on the
number or amount of sales. We might decide to use a completely randomized design and
select 12 supermarkets and display each of the designs in four different markets. Unless
the markets all had similar characteristics, differences in sales for the three package
designs might also reflect differences in the characteristics of the stores. One way to
avoid this problem is to use, say, four stores and display each of the three designs in all
four stores. This way store-to-store variability has been eliminated.
As another example, suppose the CEO of a large construction company employs three
experienced construction engineers to perform the time-consuming cost analyses,
estimates, and bids for the work on large construction projects. It is important to know
whether these three estimators tend to produce estimates at the same mean level or
whether one or another tends to always submit a high (or low) bid on projects. Each of
the three estimators would be required to produce an analysis, and estimate, and a bid
price for the same set of projects. In this way, differences in bids for the same projects
can be compared, thereby eliminating project-to-project variability.
Project
Estimator 1 2 3 4 5
1 3.52 4.71 3.89 5.21 4.14
2 3.39 4.79 3.82 4.93 3.96
3 3.64 4.92 4.19 5.10 4.20
An analysis of variance for a randomized block design partitions the total sum of squares
into three parts: SST (measures the variation among treatment means), SSB (measures
the variation among block means), and SSE (measures the variation of the differences
among the treatment observations within blocks. That is, Total SS = SST + SSB + SSE.
155
Using the following EXCEL output (see second column), Total SS = 5.09096, SST =
0.13456, SSB = 4.88896, and SSE = 0.06744.
Source of
Variation SS df MS F P-value F crit
Rows 0.13456 2 0.06728 7.981020166 0.012424095 4.458970108
Columns 4.88896 4 1.22224 144.9869514 1.6952E-07 3.837853355
Error 0.06744 8 0.00843
Total 5.09096 14
For the degrees of freedom (third column), SST has 2 = k-1 = 3-1 where k is the number
of treatments (estimators), SSB has 4 = b-1 = 5-1 where b is the number of blocks
(projects), SSE has 8 = n-b-k+1 = 15-5-3+1, and Total SS has 14 = n-1 = 15-1. The
fourth column is the mean square column. As in CRD, MS is SS/df so that MST = SST k −1 ,
MSB = SSB
b −1 , and MSE = SSE
n − b − k +1 . All three MS are independent estimates of 2 .
MST
To test Ho: No differences among the k treatment means, we use F = =
MSE
7.981020166 (see first row) with p-value of 0.012424095. Ho is rejected if F >
Fk−1,n −b − k +1 = 4.458970108 at α = .05.
MSB
To test Ho: No differences among the b block means, we use F = = 144.9869514
MSE
(second row) with p-value of 1.6952E-07. Ho is rejected if F > Fb−1,n −b − k +1 =
3.837853355 at α = .05.
(x i ) 2
− x j t / 2,n −b −k +1 MSE
b
.00843 2
For example, a 95% CI for µ1 - µ3 is (4.294-4.41) 2.306 = -.116
5
0.13390694 = (-0.24990694, 0.01790694) . Can verify CI for µ1 - µ2 is (-.01791,
.249907) and CI for µ2 - µ3 is (-.36591, -.09809).
156
763 1335 596 3742 1632
4365 1262 1448 1833 5078
2144 217 1183 375 3010
1998 4100 3200 2010 671
5412 2948 630 743 2145
957 3210 942 867 4063
1286 867 1285 1233 1232
311 3744 128 1072 1456
863 1635 844 3105 2735
1499 643 1683 1767 767
Ti = 19598 19961 11939 16747 22789
91034 = T = Ti
T2
= CM =
165743783 n
249897602 = x 2
ij
84153819 =Total SS
= xij2 − CM
i j
Ti =
2
384081604 398441521 142539721 280462009 519338521
2
Ti
=
ni 38408160 39844152 14253972 28046201 51933852
Ti 2
=
172486337.6 ni
6742554 =SST
Ti 2
= − CM
i ni
77411264 =SSE
=Total SS - SST
ANOVA
Source of Variation SS
Between Groups 6742554.48
Within Groups 77411264.4
Total 84153818.88
157
The chi-square statistic for testing independence is also applicable when testing
Ho: p1 = p2 = …= pk.
Example: In a shop study, a set of data was collected to determine whether or not the
proportion of defectives produced by workers was the same for the day,
evening, or night shift work. The following data were collected:
Shift
Day Evening Night
Defectives 45 55 70
Nondefectives 905 890 870
Use a .025 level of significance to determine if the proportion of defectives is the same
for all three shifts.
Shift
Day Evening Night Total
Defectives 45 (57.0) 55 (56.7) 70 (56.3) 170
Nondefectives 905 (893.0) 890 (888.3) 870 (883.7) 2665
Total 950 945 940 2835
r c (O − Eij )
2
=
2 ij
= 6.288
i =1 j =1 Eij
v = (r-1)(c-1) = (2-1)(3-1) = 2
.2025, 2 = 7.378
Decision: Accept Ho and conclude that the proportion of defectives produced is about the
same for all shifts.
158
Goodness-of-Fit Test
To illustrate, consider the following frequency distribution table constructed from the
lives of 40 similar car batteries. The batteries are guaranteed to last 3 years. Let us test
the hypothesis that the frequency distribution may be approximated by a normal
distribution with mean x = 3.41 and standard deviation s = .703.
Class Boundaries Oi
1.45-1.95 2
1.95-2.45 1
2.45-2.95 4
2.95-3.45 15
3.45-3.95 10
3.95-4.45 5
4.45-4.95 3
the 2 = i
k
(O − Ei )2 value will be small, indicating a good fit. If the observed
i =1 Ei
frequencies differ considerably from the expected frequencies, the 2 will be large and
the fit is poor.
We reject Ho: good fit if 2 2 ,v . The expected frequencies should be at least 5. This
restriction may require the combining of adjacent cells resulting in a reduction of the
number of degrees of freedom.
159
Going back to the example, the expected frequencies for each class/cell is obtained from
the normal curve having the same mean and standard deviation as our sample. These
values will be used for µ and in computing z values corresponding to the class
boundaries. For the first interval, we solve P(X < 1.95). For the last interval, we solve
P(X > 4.45). For the 4th interval, we solve P(2.95 < X < 3.45) = P(-.65 < Z < .06) = .2661
so that E4 = .2661(40) = 10.6.
Class Boundaries Oi Ei
1.45-1.95 2 0.8
1.95-2.45 1 2.7
2.45-2.95 4 6.9
2.95-3.45 15 10.6
3.45-3.95 10 10.2
3.95-4.45 5 6.0
4.45-4.95 3 2.8
Oi Ei
7 10.4
15 10.6
10 10.2
8 8.8
Thus, =
2
k
(Oi − Ei )2
= 3.015
i =1 Ei
The number of degrees of freedom for this test is 4-3 = 1, since three quantities – the total
frequency, mean and standard deviation – of the observed data were required to find the
expected frequencies. Since .205,1 = 3.841, we have no reason to reject Ho and conclude
that the normal distribution provides a good fit for the distribution of battery lives.
160
Time Series Analysis
Some Applications:
• To forecast future values of Y. It is assumed that some of the patterns observed in
the past will continue into the future. Thus, if quantifiable information about the
past can be measured then this can be used to forecast what will happen in the
future. Forecasting is an important aid in effective and efficient planning.
• To facilitate comparisons with data for past years. For example, time series data
can be used to answer the question whether or not the recent increase in
unemployment is normal for this time of the year.
• To identify indicators that coincide or precede with a change in direction of a time
series (called a cyclical turning point) and help in anticipating such.
TREND describes the long-term sweep of the series and usually modeled by a
smooth curve. There are many types of trends such as linear (a constant
amount of increase/decrease in the trend value from one period to the
next) and exponential (the trend value changes at a constant rate from one
period to the next).
SEASONAL describes the short-term recurring pattern of change in the series and
consists of relatively repetitious cycles of fixed amplitude and duration.
CYCLICAL movements in a time series that, like seasonal variations, are recurrent but
that, unlike seasonal variations, occur in cycles longer than one year. This
pattern exists when the series is influenced by longer-time economic
fluctuations.
IRREGULAR describes the miscellaneous, erratic movements in the series and tends to
have an irregular, saw-toothed pattern
General Methods:
• Averaging methods wherein past observations are given equal weights in
evaluating the forecast
• Exponential smoothing methods, wherein past observations are given unequal
weights that decay exponentially
161
Single Moving Average
Step 1 Choose the number of periods T to be used in the computation of the forecast.
The larger the value of T, the greater the smoothing effect. The smaller the value
of T, the more the moving averages follow the pattern of the data.
Y
t =1
t
FT+1 =
T
T +1
Y
t =2
t
FT+2 =
T
…
T + k −1
Y
t =k
t
FT+k =
T
Note that the oldest observation is dropped as each new observation becomes
available.
162
Single Exponential Smoothing
Step 1 Choose the weight α (between 0 and 1) that will give the smallest forecast error.
A large value of α gives very little smoothing in the forecast, whereas a small
value of α gives considerable smoothing.
e
t =1
t
MAE =
T
T
e
t =1
2
t
MSE =
T
T
e Y 100%
t =1
t
t
MAPE =
T
2
et 100%
T
Y
t =1 t
MSPE =
T
163
SES with F1 = Y1
Yt Ft ,α = .1 Ft ,α = .5 Ft ,α = .9
200 200 200 200
135 200 200 200
195 193.5 167.5 141.5
197.5 193.65 181.25 189.65
310 194.035 189.375 196.715
175 205.6315 249.6875 298.6715
155 202.5684 212.3438 187.3672
130 197.8115 183.6719 158.2367
220 191.0304 156.8359 132.8237
277 193.9273 188.418 211.2824
235 202.2346 232.709 270.4282
DEC 205.5111 233.8545 238.5428
SES with F1 = Y
Yt Ft, α = .1 Ft, α = .5 Ft, α = .9
200 202.6818 202.6818 202.6818
135 202.4136 201.3409 200.2682
195 195.6723 168.1705 141.5268
197.5 195.605 181.5852 189.6527
310 195.7945 189.5426 196.7153
175 207.2151 249.7713 298.6715
155 203.9936 212.3857 187.3672
130 199.0942 183.6928 158.2367
220 192.1848 156.8464 132.8237
277 194.9663 188.4232 211.2824
235 203.1697 232.7116 270.4282
DEC 206.3527 233.8558 238.5428
164
3-month MA
5-month MA
165