Module_BasicStatistics
Module_BasicStatistics
2013 E. C
Axum, Ethiopia
CHAPTER ONE
STATISTICS REFRESHER
1.1. Introduction
1 Business Statistics
Techno Star College
Definition of statistics – the definition of statistics can be defined based on two basic aspects.
These are:
a. Statistics as statistical data (plural sense)
b. Statistics as a method (singular sense)
a. Statistics defined as data (Plural sense)
According to this notion, Prof. Horace Secrist, “Statistics refer to the aggregates of facts affected
to a marked extent by multiplicity of causes, numerically expressed, enumerated or estimated
according to reasonable standards of accuracy, collected in a systematic manner for a pre-
determined purpose and placed in relation to each other.”
This definition makes it clear that Statistics (as numeric data) should possess the following
characteristics.
i. Statistics should be aggregates of facts
Single and isolated figures are not Statistics for the simple reason that such figures are unrelated
and can’t be compared. According to this aspect, to be Statistics, data must be in aggregate
(mass) and also the individual elements within the aggregate should relate to a common
phenomenon so that they can be compared to one another.
ii. Statistics should be affected to a marked extent by multiplicity of
causes
Since Statistics are most commonly used in social sciences it is natural that they are affected by a
large variety of factors at the same time? Putting differently, Statistics are not as such caused by
a single factor (force), rather they are outcomes of a number of (multiple) factors (forces)
operating together.
standards of accuracy
2 Business Statistics
Techno Star College
Numerical statements can either be enumerated, in which case they are supposed to be accurate
and precise or else they can be estimated by some expert observers, in which case 100%
accuracy is unlikely to be attained. In the process of estimation, reasonable standards of accuracy
must, however, be attained.
v. They should be collected in a systematic manner
If data are collected in haphazard manner, then results to be obtained are likely to lead to
fallacious conclusions. Therefore, it is essential that Statistics must be collected in a systematic
manner so that they may confirm to reasonable standards of accuracy.
vi. They should be collected for a predetermined purpose.
Statistics collected without any predetermined purpose do not serve any useful purpose.
Therefore, the purpose of collecting Statistics should be defined clearly before they are collected.
Meaning, figures (Statistics) should be collected in view of some goal or target. Moreover, the
data should be collected in such a manner that it meets the predetermined needs.
vii. They should be placed in relation to each other
For numerical facts to be called Statistics, they should be comparable either period wise or
region wise, or in reference to some other means of comparison.
b. Statistics defined as a method (Singular sense)
The second definition of Statistics refers to the science or the methods of Statistics. It is also in
the sense of its second definition that we consider Statistics as a subject. With this regard,
Statistics may be defined as:
“Statistics is the science which deal with the methods of collecting, classifying, presenting,
comparing (analyzing) and interpreting numerical data collected to throw some light on any
sphere of enquiry.”
Seligman
“Statistics is the method of judging collective, natural or social phenomenon from the results
obtained from the analysis or enumeration or collection of estimates.” King
“Statistics is the application of the scientific method in the analysis of numerical data for the
purpose of making rational decisions.” Berenson and Levin
“Statistics is the collection, presentation, analysis and interpretation of numerical data.”
Coroxton and Cowden
3 Business Statistics
Techno Star College
Summing up all the above definitions, one can define Statistics preferably as: Statistics is the
study of the principles and methods used in the collection, presentation, analysis and
interpretation of numerical data in any sphere of enquiry.
1.2. BASIC TERMINOLOGIES IN STATISTICS
As a subject (science), Statistics has its own terms and terminologies. Knowing these terms and
terminologies is fundamental in understanding the Statistical methods and concepts.
1. Variable: is a factor or characteristic that can take on different possible values or outcomes.
A variable differs from a constant is that the latter term implies that the values or outcomes
are always the same. Income, height, weight, sex, age, etc are examples of variables. In an
investigation, data are collected about one or more variables of interest. A variable can be
qualitative or quantitative (numeric).
2. Elementary Unit: is a specific person, business, product account, and so on, with some
characteristic to be measured or categorized.
3. Population: In Statistics the term population is used to mean the totality of causes (items)
under consideration in a given investigation or research. In other words, the largest collection
of observations on a variable constitutes the population. Population can be finite (limited in
its size) or infinite (unrestricted). In finite population, observations are countable- at least in
theory. In contrast, infinite population is indefinitely large. The observations cannot be even
in theory.
4. Sample: Any non-empty subset of a population is called a sample. There are different
possible samples that can be selected from a single population. Nevertheless, the one that
best reflects or represents the behavior of the population is considered to be the most
appropriate one. The critical question is “How to identify and get that best representative
sample?” In fact, the whole aim of the theory of sampling is to answer this question.
5. Parameter: is a measurable characteristic of the population or it is a numerical result
obtained as measuring the population.
6. Statistic: is a measurable characteristic of the sample. In short it is a sample result.
7. Survey: Survey or experiment is a device of obtaining the desired data.
4 Business Statistics
Techno Star College
8. Statistical Design: is a process that involves a decision problem and choosing an approach to
solving the problem. It is a guide that indicates how an investigation is going to channeled.
9. Frame: is the listing of all elementary units in the population under study. Strictly speaking,
one cannot present frame for infinite population, as the units in an infinite population are
infinite.
1.3. Types of Statistics
Broadly speaking, Statistical methods are classified into two groups or areas based on how data
are used. These areas are:
a. Descriptive Statistics
b. Inferential Statistics
a. Descriptive Statistics
Descriptive Statistics consists of the collection, organization, summarization, and presentation of
numerical data. It is concerned with describing certain characteristics of a set of observed data
(usually a sample) – that is, what it is shaped like, what number the values tend to cluster
(converge) around, how much variation is present in the data, and so forth.
In short, descriptive Statistics describes the nature or characteristics of a data without making
conclusion or generalization. The following are some examples of descriptive Statistics.
56% of the students in accounting & Finance 2nd Year section “A” are male.
Marks of 50 students in a Mathematics for Finance course are found to range from 30 to 85.
b. Inferential Statistics
Inferential Statistics, also called inductive Statistics, is concerned with the process of drawing
conclusions (inferences) about specific characteristics of a population based on information
obtained from samples, performing hypothesis testing, determining relationships among
variables, and making predictions. The area of inferential Statistics entirely needs the whole
aims to give reasonable estimates of unknown population parameters.
The following Statistics are some examples of inferential Statistics:
The result obtained from the analysis of the income of 1000 randomly selected citizens in
Ethiopia suggests that the average perception income of a citizen in Ethiopia is 30 Birr.
5 Business Statistics
Techno Star College
Note: In the above example we are trying to represent the income of about 65 million population
of Ethiopia by a sample of 1000 citizens, hence we are making inference or
generalization.
1.4. FUNCTION OF STATISTICS
The main function of Statistics is to collect and present numerical data in a systematic manner so
that it may be analyzed in a scientific way. Statistics basically concentrates on the analysis of a
phenomenon in a scientific manner, without proving it. The analysis of data, which is the core
objective of Statistics, is important because it helps to avoid or replace arbitrary decisions,
dogmatism, rule of thumb, tradition, and it tries to increase the custom of making decision based
on analyzed quantitative facts.
The following are the major functions of Statistics:
i. It simplifies mass of data (condensation)
It is common that the human mind is not capable of assimilating huge (mass of) facts and figures,
as they are complex to understand. Statistical methods simplify this complexity by making the
huge data easily intelligible and readily understandable. Meaning, Statistical methods provide the
necessary means to condense mass of data and present them with the help of simple figures such
as averages, ratios, variations, measures of skew ness, coefficients, etc. More attractive and
understandable presentations of data are also made through the help of diagrams and graphs.
6 Business Statistics
Techno Star College
iv. Predictions
One of the major reasons making Statistical methods so critical in Business is their prediction
function. Prediction is the process of making a scientific guess about the future value of a
variable. Statistical methods made it possible to predict the likely future value of a variable based
on its past trend. Time series and regression analysis are the most commonly used methods
towards prediction.
The fact that Statistics is applicable in almost all fields of study is not a guarantee for its
perfection. Of course, there is no perfect science in the globe. Statistical methods as well have
their own limitations. The following are the major limitations:
7 Business Statistics
Techno Star College
aggregate and hence can be considered as data. Alternatively, the semester GPA of a single
student for 4 semesters also forms a Statistical data. In short, Statistical methods are suited only
to those problems or situations where group characteristics are desired to be studied.
8 Business Statistics
Techno Star College
Data obtained from a primary source is called primary data. Likewise, data gathered from a
secondary source is known as secondary data.
In most cases, secondary data is obtained from such sources as census and survey reports, books,
official records, reported experimental results, previous research papers, bulletins, magazines,
newspapers, web sites, and other publications. Different organizations and government agencies
publish information (data) in the form of reports, periodicals, journals, etc. In the case of
Ethiopia, the Central Statistics Authority (CSA) is the first to be mentioned in publishing such
relevant information (secondary data).
Advantages and Disadvantages of Primary and Secondary data
The following are major advantages of primary data over that of secondary data.
The primary data gives more reliable, accurate and adequate information, which is
suitable to the objective and purpose of an investigation.
Primary source usually shows data in greater detail.
Primary data is free from errors that may arise from copying of figures from
publications, which is the case in secondary data.
disadvantages of primary data are:
The process of collecting primary data is time consuming and costly.
Often, primary data gives misleading information due to lack of integrity of
investigators and non-cooperation of respondents in providing answers to certain
delicate questions.
Advantage of Secondary data:
It is readily available and hence convenient and much quicker to obtain than primary
data,
It reduces time, cost and effort as compared to primary data,
Secondary data may be available in subjects (cases) where it is impossible to collect
primary data. Such a case can be regions where there is war.
Some of the disadvantages of Secondary data are:
Data obtained may not be sufficiently accurate,
Data that exactly suit our purpose may not be found,
9 Business Statistics
Techno Star College
10 Business Statistics
Techno Star College
In personal enquiry method, a question sheet is prepared which is called schedule. The schedule
contains all the questions, which would extract a complete report from a respondent. Usually,
schedules are pre-tested so as to remove certain discrepancies like ambiguities of the questions
and irrelevant questions. This pre-testing process is called a pilot survey. It is worth mentioning
that the schedule is not directly given to the respondent. Rather, it is the interviewer who asks
those questions on the schedule and dot down the interviewee’s (respondent’s) response.
Depending on the nature of the interview, personal enquiry method is further classified into two
types based on the contact to a person.
Direct Personal Interview: It is a type of personal enquiry where there is a face-to-face
contact with the persons from whom the information is to be obtained. In other words, the
investigator contacts each respondent personally, without the interference of third party, and
asks questions given in the schedule one by one and notes down respondent’s replies on the
schedule.
Indirect Personal Enquiry (Interview): It is the second type of personal enquiry where the
investigator contacts third parties called witnessed who are capable of supplying the
necessary information. Here, the information is not collected directly from the respondent
but from a third person who knows the respondent well. Such an approach is useful in case
where the respondent is expected to conceal information about him or her.
Interview can also classified in to two based on the questions designed. These are:
a. Structured interview – is rigid and pre-determined to the order and structure of the
question (wording of questions) should be the same.
b. Unstructured interview – is flexible in terms of order and structure of questions.
B. Observation
Observation is a careful and systematic seeing and listening when something is going on it its
natural state. In this approach, an investigator stays at the place of survey and notes down the
observation himself. There is no enquires in the case of observation. Observation is more
experimental and usually applied in scientific studies. It is time consuming and also costly.
There are two types of observation.
11 Business Statistics
Techno Star College
c) Questionnaire Method
Questionnaire is a written list of questions their answer to be filled by the respondents. Under
this method, a list of questions related to the survey is prepared and sent to the various
respondents by post, Web sites, e-mail, etc. However, this method cannot be used if the
respondent is illiterate. It is a method that is often used in many statistical investigations.
The following are the major points that we need to take into account while preparing a
questionnaire.
The number of questions should be small. Naturally, respondents are not comfortable with
lengthy questionnaires. Lengthy questionnaires usually bore respondents. Hence, fifteen to
twenty five questions in a questionnaire are optimal. If a lengthy questionnaire is
unavoidable, it should preferably be divided into two or more parts.
The questions should be short, clear, simple and unambiguous. Moreover, the questions must
be arranged in a logical order so that natural and spontaneous reply to each is induced. For
instance, it is not appropriate to ask a person how many packets of cigarette he/she smokes
before asking whether he/she smokes or not.
Questions of sensitive nature should be avoided. Sensitive questions are those questions that
are too personal and pecuniary like “Sources of income”, “Drinking habit”, etc. The logic
here is that respondents do not willingly answer sensitive questions. Such information, if
necessary, may be gathered through interview or through other indirect questions.
Questions should be capable of objective answers. As much as possible, avoid subjective
questions and keep to questions of fact. To this end, multiple answer questions can be used.
Mail questionnaires should be accomplished by a covering letter, which should state the
purpose of the questionnaire, promise of confidentiality of responses, etc.
Furthermore, the questions preferably designed in such a way can easily be answered as yes/no.
12 Business Statistics
Techno Star College
Forms of questions – there are two types or forms of questions. These are:
a. Closed ended questions: the possible answers are given.
E.g. how many total credits you take this semester?
A. < 15 C. 19 - 24
B. 15 – 18 D. above 24
b. Open ended questions – the possible answers are not given, the respondents give what
he/she feels is the right answer.
E.g. how many total credits you take this semester? _______________
1.6.2. STAGES IN STATISTICAL INVESTIGATION
Recall that according to Coroxton and Cowden, Statistics is defined as the collection,
Presentation, analysis and interpretation of numerical data. A bit extension of the above
definition leads to the five stages of Statistical investigation. Meaning, in addition to collection,
presentation, analysis and interpretation, a Statistical investigation involves one more stage,
which is organization of data. These five stages constitute a complete Statistical study or survey.
Following are brief explanations about the purpose of each stage.
Stage 1: Data Collection: is the process of gathering information or data about the
variable of interest. Data are inputs for Statistical investigation. Data may be obtained either
from primary source or secondary source.
Stage 2: Organization of Data: includes three major steps.
a. Editing is the process of checking and connecting data for omissions, inconsistencies,
irrelevant answers and wrong computations in the collected data. Data collected utmost
care simplifies the task of editing and vice versa.
b. Classification is a task of grouping the collected and edited data into different similar
categories based on some criterion. Basically, there are four kinds of classifications
namely, Geographical, Chronological, Qualitative and Quantitative.
c. Tabulation: The objective of tabulation is to put the classified data in the form of table.
Tabulation of data should be done in such a form that it suits the nature and objectives of
the investigation. The importance of proper tabulation is immense because if the
tabulation of data is poor, its analysis will not only be difficult but also defective.
13 Business Statistics
Techno Star College
14 Business Statistics
Techno Star College
15 Business Statistics
Techno Star College
vi. Range (R): is the difference between the largest (L) and the smallest (S) values in a
data. R=L–S
RULES FOR FORMING A GFD
To construct a GFD the following points should be considered
1) The classes should be clearly defined. That is each observation should fall in to one & only
one class.
2) The number of classes neither should either to be too larger nor should be too small.
Normally, 5 to 20 classes are recommended
3) All the classes should be of the same width. An approximate suitable class width can be
obtained as:
16 Business Statistics
Techno Star College
Range R L −S
cw ≈ i . e cw = =
Number of Classes n n
R
Example. Let n 6.8263
If all the observations are whole numbers, cw = 7
If all the observations are to one decimal places, cw = 6.8
If all the observations are to two decimal places, cw = 6.83, etc.
Note that a suitable number of classes can be obtained by using the formula
n 1 + 3.322 logN.
up/down to the nearest whole number, where N is the total number of observations.
Remark Unequal class intervals create problem in graphing and computing some statistical
measures
4) Determine the class limits
i. Determine the lower class limit of the first class (LCL1), then
LCL2 = LCL1 + cw, LCL3 = LCL2 + cw,… LCLi+1 = LCLi + cw
ii. Determine the upper class limit of the first class (UCL1) i.e.
UCL1 = LCL1 + cw – u, where u = the unit of measurement, then
UCL2 = UCL1 + cw , UCL3 UCL2, … , UCLi+1 = UCLi + cw
5) Complete the GFD with the respective class frequencies.
17 Business Statistics
Techno Star College
7 – 10 7 11 26
11 – 14 10 21 19
15 – 18 6 27 9
19 – 22 3 30 3
This means that from ‘less than’ cumulative frequency distribution there are 4 observations less
than 6.5, 11 observations below 10.5, etc and from ‘more than’ cumulative frequency
distribution 30 observations are above 2.5, 25 above 6.5 etc.
RELATIVE FREQUENCY DISTRIBUTION (RFD)
It enables the researcher to know the proportion or percentage of cases in each class. Relative
frequencies can be obtained by dividing the frequency of each class by the total frequency. It
can be converted in to a percentage frequency by multiplying each relative frequency by 100%.
i.e.
fi
Rf i =
n
Where Rfi – is the relative frequency of the ith class
fi – is the frequency of the ith class
n – is the total number of observations
Note: Pfi = Rfi 100%
Where Pfi is percentage frequency of each class.
Example: The relative and percentage of frequency distribution of the above example is:
xi fi Rfi %freq. (Pfi)
3–6 4 4/30 4/30 100 = 13.33%
7 – 10 7 7/30 7/30 100 = 23.33%
11 – 14 10 10/30 10/30 100=33.33%
15 – 18 6 6/30 6/30 100 = 20%
19 – 22 3 3/30 3/30 100 = 10%
Total 30 1 100%
18 Business Statistics
Techno Star College
Presentation is a statistical procedure of arranging and putting data in a form of tables, graphs,
charts and/or diagrams.
1. Histogram: is a graph consisting of a series of adjacent rectangles whose bases are equal
to the class width of the corresponding classes and whose heights are proportional to the
corresponding class frequencies. Here, class boundaries are marked along the horizontal axis
(x – axis) and the class frequencies along the vertical axis (y – axis) according to a suitable
scale. It describes the shape of the data.
Example. Considers the following GFD and construct a histogram
Class (xi) Frequency (fi)
3–6 4
7 – 10 7
11 – 14 10
15 – 18 6
19 - 22 3
Total 30
Solution:
10
8
Class frequency
6
4
2
(fi)
19 Business Statistics
Techno Star College
15
10
5
0
0.5 7.5 12.5 17.5 22.5 27.5 32.5 37.5
Class marks (cmi)
40
Less than cumulative
30
frequency (<Cfi)
20
10
0
6.5 10.5 14.5 18.5 22.5
Upper class boundary (UCBi)
B) ‘More than’ ogive: here, lower class boundaries are plotted against the ‘more than’
cumulative frequencies of their respective class and they are joined by adjacent lines.
Example. Draw a ‘More than’ ogive for the frequency distribution in Example 11
Solution:
20 Business Statistics
Techno Star College
40
30
More than cumulative
frequency (>Cfi 20
10
0
2.5 6.5 10.5 14.5 18.5
lower class boundaries (LCBi)
4. Line graph
It represents the relationship between time (on the x-axis) and values of variable (on the y-axis).
The values are recorded with respect to the time of occurrence.
Example. Draw a line graph for the following time series.
Solution:
A line graph showing the above time series
40
Values
30 30
25
20 20
15
10 10 10
0
1986 1987 1988 1989 1990 1991
Year
Family A B C D E
Number of children 3 2 7 6 4
Solution:
Y
21 Business Statistics
Techno Star College
7 …………………
6 …………………………
5
4 ………………………………
3 ……
2 ……………
1
X
A B C D E
vertical line graph showing number of children in family A , B , C , D and E
22 Business Statistics
Techno Star College
150
100
50
0
1980 1981 1982
year
B. Multiple Bar Chart:
here two or more bars are grouped with the corresponding frequency to represent two or more
interrelated data in each category. The bars of related variables are kept adjacent to each other
for every set of values. These charts can be used if the overall total is not required and each bar
is shaded or colored separately and a key is given to distinguish them.
Example: The following table shows the production of wheat and maize in hundreds of
quintals.
Year Maize Wheat
1980 40 80
1981 20 60
1982 60 100
Solution:
The number of quintals(in thousands)
of wheat and maize100
production
100 80
80 60 60
Number of 60 40
quintals 40 20 maize
20 wheat
0
1980 1981 1982
Year
C. Subdivided Bar Chart:
It is used to present data by subdividing a single bar with respect to the proportional frequency.
Each portion of the bar is then shaded or colored and a key is give to distinguish them.
23 Business Statistics
Techno Star College
Example: The number of quintals of wheat and maize (in millions of quintals) produced by
country x in the indicated years.
produced by country X
600
tals
100%
22
80% 50 40
60% wheat
40% 78 maize
50 60
20%
0%
1980 1981 1982
Year
8. PIE CHART
24 Business Statistics
Techno Star College
A pie chart is a circle divided in to various sectors with areas proportional to the value of the
component they represent. It shows the components in terms of percentages not in absolute
magnitude. The degree of the angle formed at the center has to be proportional to the values
represented.
Example: the monthly expenditure of a certain family is given below.
300 Food
350
House rent
Clothing
100 Misc.
250
9. PICTOGRAPH (PICTOGRAM)
A pictograph is a graph that uses symbols or pictures to represent data.
Example: In comparing the population of a country from 2003 to 2005, we simply draw pictures
of people where each picture may represent 1000,000 people.
1992 - Key: = 1000,000
1991 -
1990 -
25 Business Statistics
Techno Star College
26 Business Statistics
Techno Star College
x̄ =
∑ fx =
2010
= 40 . 20
i. n 50
x̄ = A ±
∑ fd = 40 +
10
= 40 . 20
ii. n 50
For continuous series:
Direct method
x̄ =
n
∑ fcm i
Step deviation method
x̄ = A ± ( )
∑ fd1
n
×c
Age in Year 5 - 15 15 - 25 25 - 35 35 – 45 45 - 55 55 - 65
No. of Persons 8 10 14 20 16 12
Solution:
Classes f cmi fcm d = cm – A where Df D' = d/c fd1
A = 30
5 - 15 8 10 80 -20 -160 -2 -16
15 - 25 10 20 200 -10 -100 -1 -10
25 - 35 14 30 420 0 0 0 0
27 Business Statistics
Techno Star College
35 - 45 20 40 800 10 200 1 20
45 - 55 16 50 800 20 320 2 32
55 - 65 12 60 720 30 360 3 36
80 3020 620 62
x̄ =
∑ fcm =
3020
= 37 . 75
i. n 80
x̄ = A +
∑ fd = 30 +
620
= 30 + 7 .75 = 37 .75
ii. n 80
iii.
x̄ = A +
n ( )
∑ fd 1 × c = 30 + ( 6280 ) 10 = 30 + ( 0. 775) 10 = 37 .75
Combined (Pooled) Arithmetic Mean
N 1 x̄ 1 + N 2 x̄ 2 + … + N n x n
∑ N i x̄ i
i=1
x̄ c = =
N 1 + N2 + … + N n n
∑ Ni
For n number of groups, i =1
x̄ w =
x 1 w1 + x 2 w2 + … + x n wn
=
∑ wx
w 1 + w2 + … + w n ∑w
Example: An auto ride costs Birr 5 for the first km, Birr 4 for the next 3kms and Birr 9 for each
of the subsequent kms. Find the average cost per km for 10 kms.
Rate (Birr) Distance (km) w Xw
28 Business Statistics
Techno Star College
5.00 1 5.00
4.00 3 12.00
9.00 6 54.00
10 71.00
x̄ w =
∑ xw =
71 .00
= 7 . 10 Birr
∑w 10
29 Business Statistics
Techno Star College
For ungrouped data: Mode ( x^ ) = that value in the data set, which occurs most often.
For grouped data: Discrete Series: Mode ( x^ ) = the value of the variable corresponding to the
maximum frequency.
Continuous Series: The class corresponding to the maximum frequency is called the modal
class. The value of mode is obtained by the following interpolation formula.
Mode ( ^x ) = l +
[ f1 − f0
(f 1 − f 0) + ( f 1 − f 2 ) ] × c
or
( )
Δ1
Mode ^x = l + c
Δ1 + Δ 2
x 10 20 30 40 50 60
f 4 9 16 25 22 15
iii.
Classes 0-9 10 - 19 20 - 29 30 - 39 40 - 49 50 – 59 60 - 69 70 - 79
fi 328 350 720 664 598 524 378 244
Solution:
i. Mode = value which occurs most often
Mode = 25
ii. Mode = Value of the variable with maximum frequency
30 Business Statistics
Techno Star College
Mode = 40
iii. Modal Class = 19.5 - 29.5
l = 19.5 f0 = 350 f1 = 720 f2 = 664 c=9
Mode ( ^x ) = l +
[ f1 − f0
( f 1 − f 0) + (f 1 − f 2) ] × c
= 19 .5+
[ 720 − 350
( 720 − 350 ) + ( 720 − 664 )
× 9 = 19 .5 +
370
426 ]
= 27 . 3169
C. MEDIAN
It is the value which cuts a given data in to two. It is the midpoint of a data.
Computation of Median for Ungrouped and Grouped Data
For ungrouped data:
First, rearrange the values in the order of magnitude. Then apply the following formula.
( )
th
N + 1
Median (~
x ) = vallue of the item ( where n is odd)
2
xn + 1
=
2
[ ( ) ( ) ]
th th
1 N N
Median (~
x) = Value of item + Value of + 1 item ( Where n is even )
2 2 2
1
=
2 [
xn + xn
2 2
+ 1 ]
For grouped data:
Discrete Series:
1. Compute the < cfi column.
N +1
2. Search for the value of 2 , if not available, consider the value just greater than it, in
the column of < cfi.
Continuous Series:
1. Compute the < cfi column.
N
2. Search for the value of 2 if not available, consider the value just greater than it, in the
column of < cfi.
3. Then the following interpolation formula is used to calculate the median.
31 Business Statistics
Techno Star College
Median (~
x) = l +
c
f ( n2 − c . f )
where l - LCB of the median class
c - Class interval of the median class
f - Frequency of the median class
N
c.f - cumulative frequency jut less than 2
Example: Find the median of the following data.
i. a) 27, 33, 42, 25, 23, 25, 33, 28, 27, 16, 18, 12
b) 8, 5, 2, 6, 15, 10, 25
ii.
X 50 - 60 60 – 70 70 - 80 80 - 90 90 - 100 100 - 110
fi 20 21 50 40 53 16
Solution:
i. a. Rearranging:
12 16 18 23 25 25 27 27 28 33 33 42
n = 12 … even
~ 1 1 1
x = x + xn
2 n2 2 [ + 1 ] =
2
[ x 6 + x 7] =
2
[ 25 + 27 ] = 26
b. Rearranging: 2 5 6 8 10 15 25
n = 7 … odd
x n+1
~
x = = x4 = 8
2
ii.
x 50 - 60 60 – 70 70 - 80 80 - 90 90 - 100 100 - 110
fi 20 21 50 40 53 16
<cfi 20 41 91 131 184 200
()
th
n
item = 100th item = 80 - 90
Median class = Value of 2
l = 79.5, c = 10, f = 40, c.f = 91
32 Business Statistics
Techno Star College
Median (~
x) = l +
c n
f 2 ( )
− c . f = 79. 5 +
10
40
9
( 100 − 91 ) = 79 . 5 + = 81. 75
4
Athlete A 6 8 12 25 15 16 18 3
33 Business Statistics
Techno Star College
Athlete B 6 22 28 30 17 10 14
Solution: Athlete - A
Xmax = 25, Xmin = 3
Hence,
Range = Xmax - Xmin
Range = 25 – 3 = 22
x max − x min 25 −3 22
= = = 0 . 786
Coefficient of Range = x max + x min 25 + 3 28
Athelete - B
Xmax = 30, Xmin= 6
x max − x min 30 − 6 24
= = = 0 . 667
Coefficient of Range = x max + x min 30 + 6 36
Since the variation in time of Athlete-B in terms of Coefficient of smaller than that of Athlete–A,
we can conclude that groups B is more consistent.
Continuous series
Example: Calculate range and its coefficient for the mark obtained by distance students in
statistics.
Mark 11-20 21-30 31-40 41-50 51-60 61-70
Number of Students 15 28 37 10 48 63
34 Business Statistics
Techno Star College
5 - 12 10 5 + 12 17
= = 8.5
2 2
13 - 20 9 13 + 20 33
= = 16 . 5
2 2
21 - 28 6 21 + 28 49
= = 24 .5
2 2
29 - 36 8 29 + 36 65
= = 32. 5
2 2
37 - 44 4 37 + 44 81
= = 40 .5
2 2
35 Business Statistics
Techno Star College
4 4 – 5 = -1 1
6 6–5=1 1
8 8–5=3 9
9 9–5=4 16
Total 52
=
√52
6
= 2.944
A large standard deviation implies more variability in scores and smaller standard deviation
implies that the scores are closed each other or more similar.
VARIANCE
The term variance was used by R.A Fisher in 1913, According to him “it is the square of
standard deviation (2). Thus symbolically variance may be expressed as:
V =
∑ ( x − x̄ )2
N
Where, V = variance
The previous example of variance is 8.667
When the frequency distribution is given the formula will include the frequency of each
observation so that:
∑f ( x − x̄ )2
=
√∑ f ¿ ¿ ¿ ¿ ¿ V =
N
Example,
Value (X) 2 5 8 11
Frequency 4 3 10 3
Solution
8+15+80+33
¿ =6 .8
20
Xi - x̄ (xi - x̄ )2 fi(xi - x̄ )2
Xi fi Xifi
2 4 8 2 – 6.8 = -4.8 23.04 92.16
5 3 15 5 – 6.8 = 1.8 3.24 9.72
8 10 80 8 – 6.8 =1.2 1.44 14.4
11 3 33 11 – 6.8 = 4.5 17.64 52.92
20 169.2
= √∑ f ¿ ¿ ¿ ¿ ¿ = =
√ 169.2
20
= 2.91
36 Business Statistics
Techno Star College
V =
∑f ( x − x̄ )2 169 . 2
N = 20 = 8.46
When the distribution are given in terms of class interval take the midpoint of each class and
calculate standard deviation and variance by using the midpoints.
E.g. consider the following data
Class Frequency
1–3 4
4–6 3
7–9 5
10 – 12 2
14
Solution
Class Frequency Midpoints mi 2 fi mi fi mi 2 mi -
(mi - x̄ )2
(fi) (mi)
1–3 4 2 4 8 16 2 – 6 = -4 16
4–6 3 5 25 15 75 5 – 6 = -1 1
7–9 5 8 64 40 320 8–6=2 4
10 – 12 2 11 121 22 242 11 – 6 = 5 25
14 85 653
∑ f i mi 85
¿ = =6.07 ≈ 6
14 14
=
√ N ∑ f i mi2−(∑ f i mi )2
N (N−1) √
= √ 14 ( 653 ) −¿ ¿ ¿ =
9,142−225
182
=
√
1,917
182
= √ 10.533 = 3.25
2 2
N ∑ f i mi −(∑ f i mi ) 1,917
V¿ = =10.533
N (N −1) 182
37 Business Statistics
Techno Star College
CHAPTER – 2
Probability and Probability Distribution
Definition of Probability
It is the numeric value representing the chance or possibility of a particular event will occur.
It is a mathematical means of studying uncertainty and variability.
It is a numerical measurable of chance of likelihood that a particular event will occur or
happen.
For example: what is the probability that a project will be completed on time?
Tossing of a coin etc.
Important Concepts
38 Business Statistics
Techno Star College
A probability near one indicates that the event is likely or possible to occur
If probability goes from zero to one this shows increasing the chance of occurrence
If probability goes from one to zero this shows decreasing the chance of occurrence
Example if we toss a coin the probability of getting both heads and tail is zero but the
probability of getting either head or tail is certain or 1.
P (H n T) = 0 impossible
P (H u T) = 1 Possible
Probability can be expressed in terms of decimals, fractions or in percentages.
Basic Terminology in Probability
1. Event: - is the possible outcomes of an experiment or a collection of one or more outcomes
of an experiment. E.g. if the experiment is tossing a coin the event are head or tails.
2. Outcomes: - is the particular result of an experiment. E.g. in case of tossing a coin, if head
face up we will consider heads as the outcome of the experiment.
3. Experiment: - is any process or activity that generates well defined outcomes or is a process
that leads to the occurrence of one and only one event all universal possible observations.
E.g. tossing a coin, answering a questions where the answer can be correct or incorrect.
Each performance in an experiment is called trail.
If ‘k’ is the number of sequences and ‘n’ is the number of outcomes in a single sequence the total
number of outcomes: S = nk
E.g. a toss of a coin (H, T) this is the possible outcomes.
If the coin tossed two times the number of outcomes will be:
n=2 k=2
nk = 22 = 4
(H, H) (H, T) (T, H) (T, T)
4. Sample space: - is a set of all possible outcomes of an experiment.
5. Dependent events: - if the occurrence of one event affect the probability of the occurrence of
the other then the two events are said to be dependent events.
6. Independent events: - when the occurrence of one event does not affect the probability of
the other event then the two events are said to be independent events.
7. Complement events: - is the collection of sample events i.e. contains in a universe but not in
an event.
Approaches of Probability
There are three different approaches that can be used to study probability theory.
a. Classical approach
b. Relative frequency approach
39 Business Statistics
Techno Star College
c. Subjective approach
1. Classic approach
It is probability based on the assumption that the outcomes of an experiments are equally likely.
This probability is based on the idea that certain occurrences are equally likely out comes are
outcomes with the same chance of occurrence.
The probability of an event happening is computed by:
number of favourable outcomes
probability of an event=
total number of possible outcomes
eg. When we say that the probability of obtaining a head when we toss a coin is 0.5 we are
saying that, when we repeatedly toss the coin an indefinitely large number of times, we will
obtain a head 50% of the repetition.
In terms of formula
Probability of an event happening = Number of times occurred in past
Total number of observation
e.g1. If a truck operator experienced 5 accidents out of 50 truck last year, then the probability
that a truck will have an accident next year can be 5/50 = 0.10
E.g2. Suppose that 800 out of 50000 fire insurance house adds a fire. What is the probability the
fire insurance company would take as:
a. the probability of the fire = 800/50000 = 0.016
b. the probability of not fire = 50,000 – 800 = 49,200 = 49,200/50,000 = 0.984
Any probability can be estimated by a relative frequency approach is called objective
probability. The term objective means that any two persons should agree on the probability
value as it is determine objective evidence. Consequently, they could both repeat the same
experiment many times and at least in the long run observe practically the same frequency.
3. Subjective Probability
40 Business Statistics
Techno Star College
When there is no past experience or little on which to base a probability, personal judgment,
experience, intuition or expertise or any other subjective evaluation criteria will be applied to
estimating or assigning probability. This probability is subjective probability.
Subjective probability is personal in the sense that it express only a person’s degree of believe.
Thus, different people can be expected to assign different probability to the same event if the
same techniques, knowledge and identical information they have.
It is also called personal probability. Unlike objective probability one person’s subjective
probability may very well different from another person’s subjective probability of the same
event.
eg. A physician assessing the probability of a patient’s recovery and an expert in the national
bank assessing probability of currency devaluation are both making a personal judgment based
on what they know and feel about the situation and other group of physicians or experts will
arrive with different probability, though both can employee identical techniques or approaches
and information.
Both classic and relative frequency probabilities are objective in the sense that no personal
judgment is involved.
Whatever the kind of probability involved /subjective or objective/ the same set of mathematical
rules holds for manipulating and analyzing probability.
Rules of Probability
The Addition Rule
A. Addition Rule for two Dependent Events
This is applicable when the outcomes of the experiment may not be mutually exclusive. Let
A and B be events then the probability that either A or B will occur is:
P (AUB) = P (A) + P (B) - P (A n B)
Example 1: Suppose that the Ethiopian Tourist Commission selected a sample of 200 tourist to
visit the state during the year. The survey revealed that 120 tourists went Axum and 100 went to
Wondo Genet and the investigator provide that 60 tourist’s visited both Axum and Wondo
Genet. What is the probability that a person selected visited either Axum or Wondo Genet?
Solution
120 100
P (A u W) = P (A) + P (W) = + =0.60+ 0.50=1.1
200 200
41 Business Statistics
Techno Star College
P (A u W) = P (A) + P (W) – P (A n W)
120 100 60
= + − =0.60+ 0.50−0.30=1.1−0.30=0.80
200 200 200
Example 2: Consider the following data related to the students’ performance
Test result Higher (H) Average (A) Low (L) total
Qualify (Q) 150 90 60 300
Fail (F) 40 30 30 100
Total 190 120 90 400
Required: find the following probability
a. P (A)
b. P (Q)
c. P (H)
d. P (FH)
e. P (Q or H)
f. P (H or F)
g. P (FL)
h. P (F u L)
i. P (H or A)
j. P (Q or L)
k. P (Q or F)
42 Business Statistics
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
Solution
Total number of average 120
a. P (A) = = =0.30
Total number of Students 400
Total number of Qualify 300
b. P (Q) = = =0.75
Total number of Students 400
Total number of higher 190
c. P (H) = = =0.475
Total number of Students 400
d. P (FH) the symbol FH to mean both F & H notice that the intersection of F & H is the same as the
40
intersection of H & F. hence, P (FH) = P (HF) therefore P (FH) = =0.10
400
e. P (Q or H) = the number of students in the event Q or H is the number of Q’s (300) plus the number
of H’s (190)minus the number of intersection QH (150). The number at the intersection is
subtracted because it has been counted twice one with Q’s and one with H’s.
300+190−150 490−150 340
P (Q or H) = = = =0.85
400 400 400
190+100−40 290−40 250
f. P (H or F) = = = =0.625
400 400 400
Theintersection of both items 30
g. P (FL) = = =0.075
Total number os students 400
100+90−30 190−30 160
h. P (F or L) = = = =0.40
400 400 400
190+120−0 310
i. P ( H or A) = = =0.775
400 400
300+90−60 390−60 330
j. P ( Q or A) = = = =0.825
400 400 400
300+100−0 400
k. P ( Q or F) = = =1
400 400
B. Addition Rule for Two Mutually Exclusive Events
Two events are said mutually exclusive if they have no sample space outcomes in common. In this
case the event A & B cannot occur simultaneously and thus P(AnB) = 0
Let A & B Mutually exclusive events then, the probability that either A or B will occur is
P (A U B) = P (A) + P (B)
P (A or B) = P (A) + P (B)
Example: Consider the following data:
Weight Event Number Of Package
Under weight A 100
Satisfactory B 3600
Over weight C 300
Required
a. What is the probability that a particular will be either underweight or over weight?
b. What is the probability of P (A + C) ´?
Solution
Page 43 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
a. P ( A or C) = P(A) + P(C)
100 300
= + =0.025+ 0.075=0.10
4000 4000
3600
b. P (A + C)´ = =0.90 OR
4000
P (A + C) + P (A + C) ´ = 1
0.10 + P (A + C) ´ = 1
P (A + C) ´ = 1 – 0.10 = 0.90
Example - Consider randomly selecting a card from a standard deck of 52 playing cards and define the
events. J, a randomly drawn card is Jack; Q, a randomly drawn card is Queen; and K, a randomly drawn
card is a king. Since there are 4 Jacks, 4 Queens and 4 Kings in the deck.
P(Q) =4/52 P(K) = 4/52 P(J) = 4/52 Since there is no card that is both a J & Q the event J and Q are
mutually exclusive and thus P(JnQ) = 0 it follows that the probability that the randomly selected card is
either J or Q is:
P(JUQ) = P(T) + P(Q) = 4/52 + 4/52 = 2/13
Example . A contractor is bidding for two projects with Co. A and Co. B. the probability of obtaining the
project with Co. A is 0.45 and the probability of obtaining Co. B if Co. A won is 0.90. What are the
contractor’s chances of getting both projects?
Solution:
P (A) = 0.45
P (B/A) = 0.90 and we are looking for P (AnB), which is the probability that both A and B will
occur. From the equation we have
P (AnB) = P (B/A) P (A) = 0.9 x 0.45 = 0.405
Page 44 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
Example 1. An electronic devise has four independent components C 1, C2, C3, C4, with a reliability of
0.85 each. The device works only if all four components are functional. What is the probability that the
device will work when needed?
P (the device will work) = P (all components will work) = P (c1, nc2, nc3, nc4)
= P (C1) P (C2) P (C3) P (C4)
=0.85 x 0.85 x 0.85 x 0.85 = 0.522
Example 2. The rate of defects in corks of wine is 0.75. Assuming independence, if four bottles are
opened (B1, B2, B3, and B4), what is probability that four corks are defective?
P (all 4 are defective) = P (B1 n B2 n B3 n B4) = P (B1) P (B2) P (B3) P (B4)
= 0.75 x 0.75 x 0.75 x 0.75=0.316
Probability Tree
It is a graphic device that depicts all possible outcomes sequential experiments or trial along with the
probability of each elementary events or outcomes.
Example: A student takes a quiz that consist of three true or false questions. If we consider our
experiment to be answering the three questions, each question can be answered correctly or
incorrectly.
Solution
n=2 k=3 S = nk = 23 = 8
Let c denote answering a question correctly and I denote answering a question incorrectly. Then we
can depict a tree diagram of the sample space outcome for the experiment.
C
C I
IC
C I
I C I
I
I
This diagram portrays the experiment as a three-step process
Step I – answering the 1st question (Correctly or incorrectly) (C or I)
Page 45 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
Step II – answering the 2nd question (Correctly or incorrectly).
Step III – answering the 3rd question (Correctly or incorrectly).
The tree diagram has eight different branches and the eight distinct sample space outcomes are listed
at the end of the branches. We see the sample space is
CCC CCI CIC CII
ICC ICI IIC III
Now suppose that the student was totally unprepared for the test, and has to blindly guess the answer
to each question that is the student has a 50-50 chance or 0.5 probability of correctly answering each
question. This means that each of the eight sample space outcomes is equally likely to occur. I.e. P
(CCC) = P (ccI) ------P (III) =1/8
Here also the sum of the probabilities of the sample space out comes is one.
In General the sum of the probabilities of all the sample space is equal to 1.
Random Variable
A random variable provides a means of assigning numerical values to experimental outcomes.
A random variable is a variable whose values are determined by chance. Or,
A random variable is a numerical description of the outcome of an experiment.
Notation: Random variables are usually denoted by capital letters like X, Y, Z, etc.
Example: Suppose you take a 50 questions multiple choose exam. Guessing every answer and we are
interested in the number of correct answer obtained.
a. What is the random value? Interested variable because the selection of correct number of answer
b. What is the value of this random value have? Because of 1, 2, 3….. 50 therefore 0≤ RV≤50
Types of Random Variables
Depending upon the numerical values it can assume, a random variable can be classified into two major
divisions.
A) Discrete Random Variable: is a random variable that may assume either a finite number of values or
an infinite sequence (e.g. 1, 2, 3…) of values. In general, a discrete random variable takes whole
number values, which can be counted or enumerated.
Example: - the number of students who are enrolled for degree program in JJU
- The number of employees absent in a given day
- The number of defective products produced in a factory at a given month.
B) Continuous Random Variable: is a random variable which may take on all values in a certain
interval or collection of intervals. It take any numerical value within a continuous scale or within some
Page 46 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
range. A Continuous random variable, as the name implies, assumes all possible values between any
two values.
Example - The distance b/n two cities
- The weight of a person.
- The rate of return on investment
- The time that a customer must wait to receive his changes.
PROBABILITY DISTRIBUTION
Probability distribution is a list of all possible outcomes that may result from an experiment and the
probability associated with each other outcomes.
A probability distribution is a correspondence, which assigns probabilities to the values of a random
variable.
Example: Suppose we are interested in the number of heads showing face up on the three tosses of a coin.
Solution
A. n = 2 events the possible outcomes of the experiment is S = nk = 23 = 8
K = 3 experiment
Possible outcomes Coins of tosses Number of heads
1st 2nd 3rd
1 T T T 0
2 T T H 1
3 T H H 2
4 T H T 1
5 H T T 1
6 H H T 2
7 H T H 2
8 H H H 3
Page 47 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
Types OF PROBABILITY DISTRIBUTION
A probability distribution can be classified as a discrete or continuous probability distribution according to
whether it assumes a discrete or continuous random variable.
A. Discrete Probability Distribution
The probability of a discrete random variable is a list of all the outcomes of an experiment and the
probability associated with each outcomes. It is a table, graph or formula that gives the probability
associated with each possible value that a random variable can assume or if we organize the value of a
discrete random variable in a probability distribution is called a discrete probability distribution.
In the construction of the probability distribution for a discrete random variable, the following two
conditions must be satisfied.
Properties (Required Conditions) for a Discrete Probability Distribution
1. The sum of the probabilities of all the events in the sample space must equal 1 i.e. P(x) =1
2. The probability of each event in the sample space must be between or equal to 0 and 1.
i.e. 0 P(x) 1
Example: determine the probability distribution for the number of heads in two tosses of a coin.
Solution
n =2, k=2 S = nk = 22 = 4
T T = 0 heads
T H = 1 heads
H T = 1 heads
H H = 2 heads
The probability of heads 0, 1, 2
0 = 1 times 1 = 2 times 2 = times
So P (0) = ¼ = 0.25 P (TT) = P (T) P (T) = 0.50 x 0.50 = 0.25
P (1) = 2/4 = 0.50 P (TH) or P (HT) = P (T) x P (H) + P (H) x P (T)
= 0.50 x 0.50 + 0.50 x 0.50
= 0.25 + 0.25 = 0.50
P (2) = ¼ P (HH) = P (H) P (H) = 0.50 x 0.50 = 0.25
Probability distribution
Random variable X Probability P(X)
0 ¼ = 0.25
1 2/4 = 0.50
2 ¼ = 0.25
1.00
Bell - Shaped
Mean
The shape and position of the normal distribution curve depends on two parameters, the mean and the
standard deviation. Each normally distributed variable will have its own normal distribution curve.
34.13% 34.13%
μ = 24 years
The curve “A” is very tall and narrow because it has small standard deviation than curve “B” and
curve “C”. Small standard deviation means there is no variation (they are closely related). Curve
“C” short (low) and wide (broad) because it has a larger standard deviation than curve A and B.
Generally, the shape of the curves is determined by the standard deviation. The smaller the
standard deviation the more packed the curve will be and the larger the standard deviation the
more flat and wider the curve will be.
b. Different means but equal standard deviation. Both sections have equal standard deviation
3.1 but different means S1=23 S2=26 S3=28
δ = 3.1 δ = 3.1
δ = 3.1
μ = 23 μ = 26 μ = 28
c. Different means and different standard deviations
For S1 = 22 and =2.8 S2 =24 and =2.1
S3 =27 and =3.1
δ = 2.8
Page 50 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
δ = 2.1
δ = 3.1
Page 51 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
x −μ 98−100 −2
At X = 98, Z= ¿ ¿ =−0.4
δ 5 5
x −μ 100−100 0
At X 100, Z= ¿ ¿ =0
δ 5 5
x −μ 101−100 1
At X 101, Z= δ ¿ 5
¿ =0.2
5
x −μ 105−100 5
At X 105, Z= ¿ ¿ =1
δ 5 5
Page 52 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
e. What is the probability of a workers selected at a random will take fewer than 580 hours to
complete the program?
f. What is the probability of a candidate chosen at a random will take between 420 and 570 hours to
complete the program?
Solution
a. Given δ = 100, μ = 500 X > 500, Z =?
x −μ 500−500 0
Z= ¿ ¿ =0
δ 100 100
P (0 to ∞) = 0.5
X 500
Z 0
b. Given δ = 100, μ = 500 X1 500 and X2 = 650
X −μ 500−500 0
At X1 = 500, Z1 = 1 ¿ ¿ =0 ,
δ 100 100
X −μ 650−500 150
At X2 = 650, Z 2= 2 ¿ ¿ =1.50
δ 100 100
P (0 to 1.50) = 0.4332 from the table of normal distribution
X 500 650
Z 0 1.50
X 500 700
Z 0 2
d. Given δ = 100, μ = 500 X1 550 and X2 = 650
X −μ 550−500 50
At X1 = 550, Z1 = 1 ¿ ¿ =0.50
δ 100 100
Page 53 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
X −μ 650−500 150
At X2 = 650, Z 2= 2 ¿ ¿ =1.50
δ 100 100
P (0.50 to 1.50) = P (0 to 1.50) – P (0 to 0.50)
= 0.4332 – 0.1915
= 0.2417
X 500 580
Z 0 0.80
f.
g. Given δ = 100, μ = 500 X1 420 and X2 = 570
X −μ 420−500 −80
At X1 = 550, Z1 = 1 ¿ ¿ =−0.80
δ 100 100
X −μ 570−500 70
At X2 = 650, Z 2= 2 ¿ ¿ =0.70
δ 100 100
P (-0.80 to 0.70) = P (-0.80 to 0) + P (0 to 0.70)
= 0.2881 + 0.2580
= 0.5461
Page 54 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
b. About 95% of the batteries failed between, what are the two values?
Solution
a. Given δ = 1.2, μ = 19
If the % is 68% then μ ± 1δ
μ - 1δ μ + 1δ
19 – 1 (1.2) 19 + 1(1.2)
19 - 1.2 19 + 1.2
17.80 20.20
b. Given δ = 1.2, μ = 19
If the % is 95% then μ ± 2δ
μ - 2δ μ + 2δ
19 – 2 (1.2) 19 + 2(1.2)
19 - 2.4 19 + 2.4
16.60 21.40
CHAPTER - THREE
SAMPLING AND SAMPLING DISTRIBUTIONS
3.1. INTRODUCTION
TRODUCTION
Sampling in statistics is a common and important as salt is in food. In homes, ladies take out one
teaspoonful to detect the quality what she is cooking. In medical sciences, a few drops of blood are taken
and tested microscopically or chemically to know whether the blood contains some abnormalities or not.
Nowadays, sampling methods are extensively used in socio-economic surveys to know the living
condition, cost of living index etc. of a class of people. In biological studies, experiments are conducted on
some units (persons, animals or plants) and inferences are drawn about the breed or variety to which the
units belong. In the industries sampling procedures are predominantly used for quality control.
Sampling theory is the study of relationships existing between a population and samples drawn from the
population.
3.2. SOME CONCEPTS ASSOCIATED WITH SAMPLING
Population or study population: - are individuals, groups or communities or societies from which you select a
Page 55 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
few from in order to find answer to your research questions and usually denoted by the letter N.
Sample: - the small group from whom you collect the required info. to estimate the prevalence of the
issue.
Statistic:
Statistic: - Statistical measurable value of the sample or a measurable characteristic value of the sample.
Sampling:
Sampling: - is the process of selecting a few (a sample) from a bigger group (the sampling population) to
become the basis for estimating/predicting a fact, situation/outcome regarding the bigger
group..
Sampling frame:- a list identifying each unit in the study population.
Parameter:
Parameter: - A measurable value of the population or a measurable characteristic value of the population.
It is a population result.
Sampling design:
design: - A sample design is a definite plan for obtaining a sample from the sampling frame. It
refers to the technique or the procedure that one would adopt in selecting some sampling
units from which inferences about the population is drawn. Sampling design is determined
before any data are collected. Sampling techniques are divided in to two: Random Sampling
and non-Random
non-Random Sampling.
Sampling.
Sampling error:
error: - is the difference between a sample statistic and its corresponding population parameter.
Population distribution:
distribution: - is the distribution of individual measurement of a population.
Sampling distribution:
distribution: - is a probability distribution of a sample and statistics.
Page 56 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
thorough investigation of missing, wrong, or suspicious information, better supervision, and better
processing than is possible with complete coverage.
3.4 Types of Sampling Techniques
Sampling technique refers to the method of selecting a sample from the universe (population). The right
type of sampling technique is of paramount importance in the execution of a sample survey in accordance
with the objectives and the scope of the inquiry. The sampling methods may be classified:
1. Random/Probability sampling
2. Non – Random/Non-probability sampling
1. RANDOM (PROBABILITY) SAMPLING
Random sampling method is a method of selection of a sample such that each item within the population
has equal chance of being selected.
In this method, there is no place for investigator’s bias in sample selection since it depends on probability.
It provides more accurate estimates in the sense of greater precision. There are three commonly used types
of probability sampling.
a. Simple Random Sampling Method (SRSM):-
(SRSM):- involves very simple method of drawing a sample
from a given population. The selection of samples is random in character. The oldest method adopted
in simple random sampling is the use of lottery system.
Suppose population size is 100 and sample size is 10 I.e. N = 100 and n = 10. Hundred chits would be
prepared bearing the serial number of units in the universe. These chits would be put together and
shuffled thoroughly, and then ten would be drawn one by one. The sampling units corresponding to
the number on the selected chits will form a random sample. This method gives a sample, which is
quite independent of the natures of universe. This method is commonly in practice even at present.
b. Stratified Random Sampling Method (STRSM)
Under this method, the whole population is divided into a number of homogeneous groups or strata. From
each of these strata, random sample of size ‘n’ is selected. Thus, stratified RS means selecting a number of
random samples, one from each stratum of the universe. It is used when each group has small variation
within itself but wide variation between the groups.
The sample may be either proportionate or disproportionate. With proportionate stratified sampling, the
number of elements from each stratum in relation to its proportion in the total population is selected,
whereas in disproportionate stratified sampling, consideration is not given to the size of the stratum.
Suppose the universe is divided into two groups consisting of 100 and 160 respectively and their
respective sample sizes being 10 % of the universe. Meaning a sample of size 10 + 16 = 26 is drawn in
proportion to the total number of items. But in disproportionate stratified RSM, samples are taken from
Page 57 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
each stratum regardless of the number of units in the universe. Thus in the above example, an equal
number of units i.e. 13 from each stratum may be drawn in which the total number of items in the sample
is 26.
- The size of sample items which must be selected from the ith stratum is denoted by ni and is given by
nN i
ni = Where n – Sample size
N N – Population size
th
Example: In AKU a survey is toNbe i – Size of the i stratum
conducted on 120 students’ tendency towards Accounting and
Finance. The total number of students in each field is as indicated below. Give the sample size of each
field of study.
Ni N1 N2 N3 N4 N5
Field of study Economics Procurement Law Marketing Management
No. of students 3000 2000 1500 2500 1000
Solution: n = 120
N1 = 3000, N2 = 2000, N3 = 1500, N4 = 2500, N5 = 1000
N = N1 + N2 + N3 + N4 + N5 = 10,000
nN 1 120 x 3000 nN 2 12
n1 = = = 36 n2 = = x 2000 = 24
Then, N 10 ,000 N 1000
n 120 n
n1 = , N1 = x 3000 = 36 i.e. = 0.012
Or N 10,000 N
12 12 12
n3 = x 1500 = 18 , n 4 = x 2500 = 30 , n5 = x 1000 = 12
1000 1000 1000
[ ]
th
N
ni = i + w
th
In general, the i element of the sample is n item. Where 0 w n – 1
Or we can have an alternative method,
Ai = A1 + (i – 1) K. Where፡
Where፡ A1 – the random starting point or the first sample item.
Ai – the ith item in the sample
Page 58 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
Example 1: - From the files of 24 cases of the federal high court, the cases of only 4 of these is to be seen.
The fifth file was selected randomly. Indicate the remaining three elements of the sample.
Solution: - N = 24 , n = 4 , A1 = 5
N 24
= =6
K= n 4
Then A2 = A1 + (2 – 1) K
= A1 + K = 5 + 6 = 11. The 11th file is the second element
A3 = A1 + (3 – 1) K
= A1 + 2K = 5 + 2 (6) = 17. The 17th file is the third element.
A4 = A1 + (4 – 1) K
= A1 + 3K = 5 + 3 (6) = 23. The 23rd file is the fourth element.
Example 2:2: - If the 4th and 12th elements of a systematic sample are 70 and 126 (in the population)
respectively, then which item of the population is the first element of this systematic sample.
Solution: - A4 = 70 , A12 = 126 , A1 = ?
A4 = 70 = A1 + 3K taking these two simultaneous equations,
A12 = 126 = A1 + 11K
56 = 8K K = 7
Then A4 = A1 + 3K 70 = A1 + 3(7) A1 = 70 – 21 = 49
The 49th item of the population is the random starting point for the systematic samples.
Page 59 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
and sample unit. It is the combination of judgment and stratified sampling methods. So it enjoys the
merits of both.
3.5 SAMPLING DISTRIBUTION
NOTE: The normal probability distribution is used to determine probabilities for the normally distributed
individual measurements, given the mean and the standard deviation. Symbolically, the variable is
the measurement X, with the population mean µ and population standard deviation δ. In contrast to
such distributions of individual measurements, a sampling distribution is a probability distribution
for the possible values of a sample statistic.
Population distribution: Is the distribution of measured values of its members and have mean denoted by μ
and variance δ 2and standard deviationσ . The population standard deviation describes the variation among
values of members of the population; where as the standard deviation of sampling distribution measures the
variability among values of the statistics (sample) such as mean values, proportion values due to sampling
errors.
Sample distribution: Is the distribution of measured values of sample in random samples drawn from a
given population. Each sample mean would vary from sample to sample. This variability serves as the
basis for random sampling distribution. A sampling distribution is a probability distribution for the
possible values of a sample statistic, such as a sample mean.
Page 60 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
The expression √ N −n
N −1 is called finite population correction factor/finite population multiplier. In the
calculation of the standard error of the mean, if the population standard deviation δ is unknown, the
standard error of the meanδ x , can be estimated by using the sample standard error of the mean
SX
The relationship between the shape of the population distribution and the shape of the sampling
distribution of the mean is called the Central Limit Theorem.
The significance of the Central Limit Theorem is that it permits us to use sample statistics to make
inference about population parameters without knowing anything about the shape of the frequency
distribution of that population other than what we can get from the sample. It also permits us to use the
normal distribution curve for analyzing distributions whose shape is unknown. It creates the potential for
applying the normal distribution to many problems when the sample is sufficiently large. As mentioned
earlier the above properties must exist, given this value of sample mean X is first converted in to a value
Z on the standard normal distribution to know how any single value deviates from X of sample mean
values ( μ x), by using the formula;
X−μ
X−μ x
Z= = δ because μ x= μ
δx
√n
Example: Themean length of a certain tool is 41.5 hours with a standard deviation of 2.5 hours.
What is the probability that a simple random sample of size 50 drawn from this population will have
a mean between 40.5 hours and 42 hours?
μ=41.5 δ =2.5 n=50
P (40.5≤ X ≤42.0) =?
δ 2.5 2.5
μ x= μ δ x = = = = 0.3536
√ n √50 7.0711
The population distribution is unknown, but sample size n=50 is large enough to apply the central limit
theorem. Hence the normal distribution can be used to find the required probability.
X 1−μ X 2−μ
P (40.5≤ X ≤420) = P ( ≤Z≤ )
δx δx
Page 61 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
40.5−41.5 42−41.5
=P( ≤Z≤ )
0.3536 0.3536
= P (−2.8281 ≤ Z ≤ 1.4140)
=P ( Z ≥−2.8281) + P ( Z ≤ 1.4140)
=0.4977+0.4207=0.9184
Thus 0.9184 is the probability of the tool having mean life between the required hours.
δ=2.50
0.49
0.42
Example: A continuous manufacturing process produces items whose weights are normally distributed with
a mean weight of 800gms and a standard deviation of 300gms. A random sample of 16 items is to be
selected from the process.
A. What is the probability that the arithmetic mean of the sample exceeds 900gms? Interpret the result.
B. Find the values of the sample arithmetic mean within which the middle 95% of all sample means will fall.
Solution:
A. P ( x ≥ 900 ) =?
μ X =μ=800gms δ =300gms
n=16
P ( x ≥ 900 ) =?
δ 300 300
δx = = = = 75
√ n √16 4
0.0918
μ X =800 X =900
X−μ x 900−8 00
P ( x ≥ 900 ) =P (Z≥ = ¿
δx 75
=P (Z≥ 1.33 ¿
=0.5000-0.4082
=0.0918
B. Since Z=2 for the middle 95% area under the normal curve, therefore using the formula for z to solve for the
values of x in terms of the known values are as follows.
x 1= μ X -Zδ x x 2= μ X +Zδ x
=800-2(75) =800+2(75)
=650gms =950gms
0.95
=300
Page 62 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
√ √
μ P = P and δ P = pq = q(1−P)
n n
If a large ample size (n≥30) satisfying following two conditions.
A. np≥5
B. nq≥5
Then the sampling distribution of proportions is very closely normally distributed. It may be noted that the
sampling distribution of the proportion would actually follow binomial distribution because population is
binomially distributed.
For finite population in which sampling is done without replacement we have;
√ √
μ P = P and δ P = pq = N −n
n N −1
Under the same guidelines as mentioned in the previous sections, for a large sample size n ≥30, the sampling
distribution of proportion is closely approximated by a normal distribution with a mean and standard
deviation as stated above. Hence, to standardize sample proportion P, the standard normal variable.
P−P
P−μ P
Z=
δP
√
= pq
n
Example: Few years back, a policy was introduced to give loans to unemployed engineers to start their own
business. Out of 1,000,000 engineers, 600,000 accepted the policy and got the loan. A sample of 100
unemployed engineers is taken at the same time of allotment of loans. What is the probability that sample
portion would have exceeded 50% acceptance?
Solution:
μ P = P=0.60 N=1,000,000
n=100 P ( P ≥ 0.5) =?
√ √
δ P = pq = ( 0.6 ) (0.4) =0.0489
n 100
P−μP 0.50−0.60
P ( P ≥ 0.5) =P (Z≥ ) =P (Z≥ )
δP 0.0489
=0.4793+0.5000
=0.9793
Page 63 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
0.4793
0.5000
5 P=0.60
Example: A manufacturer of watches has determined from past experience that 3% of the watches
he produces are defective. If a random sample of 300 watches is examined, what is the probability
that the proportion of defective is between 0.02 and 0.035?
μ P = P=0.03 P2=0.035
P1=0.02 n=300
√
δ P = ( 0.03 ) (0.97) =0.0098
300
P−P P−P
P (0.02≤ P≤ 0.035 ) = P ( ≤ Z≤ )
δP δP
0.02−0.03 0.035−0.03
=P( ≤Z≤ )
0.0098 0.0098
−0.01 0.005
=P( ≤Z ≤ )
0.0098 0.0098
= P (-1.02≤ Z ≤ 0.51)
=P (Z≥−1.02) + P (Z≤ 0.51)
=0.3461+0.1950
= 0.5411
Hence the probability that the proportion of defective will lie between 0.02 and 0.035 is 0.5411
0.3461 0.1950
δ (X −X )
1 2
δ (X − X )
1 2
Page 64 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
1 2
√ δ2 δ 2
δ ¿¿= √ δ X 2 + δ X 2 = 1 + 2 (standard error of sampling distribution of difference of two means)
n1 n2
n1 and n2 are independent random samples drawn from first and second population,
respectively.
Example: Car stereos of manufacturer A have a mean lifetime of 1,400 hours with a standard
deviation of 200 hours, while those of manufacturer B have a mean life time of 1,200 hours with a
standard deviation of 100 hours. If a random sample of 125 stereos of each manufacturer are tested,
what is the probability that manufacturer A’s stereos will have a mean life time which is at least;
A. 160 hours more than manufacturer B’s stereos?
B. 250 hours more than manufacturer B’s stereos?
Solution:
Manufacturer A μ1=1,400 hours
δ 1= 200 hours n1 =125
Manufacturer B μ2=1,200 hours
δ 2= 200 hours n2 =125
a)
√ √
2 2 2 2
δ ( X −X )= δ 1 + δ 2 = (200) + (100) = √ 320+80=√ 400 =20
n1 n2 125 125
1 2
P ( X 1 −X 2 ≥ 160) = P ( Z ≥ ¿ ¿)
160−200
=P ( Z ≥ )
20
=P ( Z ≥−2)
=0.5000+0.4772
=0.9772 (area under normal curve)
0.9772
X 1 −X 2=160 μ X −X =200
1 2
Hence, the probability is very high that the life time of the stereos of A is 160 hours more than that of B.
b) Proceeding in the same manner as in part A) as follows:
( X 1−X 2) −(μ 1−μ2) 250−200
P ( X 1 −X 2 ≥ 250) = P (Z ≥ =P ( Z ≥ )
δ (X −X )
1 2
20
=P ( Z ≥ 2.5)
=0.5000 - 0.4938
=0.0062 (area under normal curve)
Page 65 of 91
AkU, CBE Department of Accounting Short Note
on Sampling & Sampling Distributions
0.0062
Example: The strength of a wire produced by company ‘A’ has a mean of 4,500kg and a δ 1of 200 kg.
Company ‘B’ has a mean of 4,000 kg and a δ 2of 300 kg. if 50 wires of company ‘A’ and 100 wires of
company ‘B’ are selected at random and tested for strength, what is the probability that the sample mean
strength of ‘A’ will be at least 600gk more than that of ‘B’?
Given:
μ1= 4,500 μ2= 4,000
δ 1=200 δ 2=300
n1 =50 n2 =100
√ √
2 2 2 2
δ ( X −X )= δ 1 + δ 2 = (200) + (300) = =41.23
n1 n2 50 100
1 2
P ( X 1 −X 2 ≥ 600) = P ( Z ≥ ¿ ¿)
600−500
=P ( Z ≥ )
41.23
=P ( Z ≥ 2.43)
=0.4925
=0.5000 - 0.4925=0.0075 (area under normal curve)
0.0075
Page 66 of 91
3.5.4 SAMPLING DISTRIBUTION OF THE DIFFERENCE OF TWO
PROPORTIONS
Suppose two populations of size N 1and N 2are given. For each sample of size n1 from the first
population, compute sample proportion P1and standard deviation δ P . Similarly for each sample
1
size of n2 from the second population, compute sample proportion P2 and standard deviation δ P . 2
For all combinations of these samples from these populations, we can obtain a sampling
distribution of the difference P1−P2 of sample proportion. Such a distribution is called sampling
distribution of the difference of two proportions. The mean and standard deviation of this
distribution are given by;
μ P −μ P = P1−P2
1 2
δ ¿¿= √ δ P 2 + δ P 2 =
1
n1 2
+
√
P1 q 1 P 2 q 2
n2
If sample size n1∧n1 are large i.e. n1 ≥30, then the sampling distribution of difference of
proportions is closely approximated by a normal distribution.
Example:10% of the machines produced by company ‘A’ are defective and 5% of these produced
by company ‘B’ are defective. A random sample of250 machines is taken from company A and a
random sample of 300 machines is taken from company B. what is the probability that the
difference in sample proportion is less than or equal to0.02?
μ P −μ P = P1−P2= 0.10−0.05=0.05
1 2
n1 =250 n2=300
The standard error of the difference in a sample proportions is given by
δ ( P −P )= √ δ P 2 + δ P 2 =
1 2 1 2
√ P1 q 1 P 2 q 2
n1
+
n2
=
√
0.10 x 0.90 0.05 x 0.95
250
+
300 √=
0 ,09 0.0475
250
+
300
= √ 0.00036+0.000158 = √ 0.0052 = 0.0228
The desired probability of the difference in sample proportion is given by
( P 1−P2 )−(P1−P2)
P¿0.02) =P ( Z ≤
δ (P − P ) 1 2
0.02−0.05
=P ( Z ≤ )
0.0228
=P ( Z ≤−1.32 )
=0.5000 - 0.4066=0.0934 (area under normal curve)
Hence the desired probability for the difference in sample proportions is 0.0934
0.0
STATISTICAL ESTIMATION
4.1 INTRODUCTION
Managers in business, education, social work, and other fields make decisions without complete
information. Automobile manufacturers do not know exactly how many people will purchase
new cars next year. The college registrar does not know exactly how many students will enroll
next fall, but based on the past experience may lay down an estimate plan. Everyone makes
estimates. When you get ready to cross a street, you estimate the speed of the car that is
approaching, the distance between you and the car and your own speed. Having made these
quick estimates, you decide whether to wait, walk or run. In such decisions without complete
information, there is a considerable uncertainty.
In statistical inference, one estimates about the population based on the result obtained from the
sample selected from that population. Thus, estimation is a process by which we estimate various
unknown population parameters from sample statistics.
Any sample statistic that is used to estimate a population parameter is called an estimator and an
estimate is a numerical value of an estimator.
The sample mean is often used as an estimator of the population mean. Suppose that we calculate
the mean daily revenue of a store for a random sample of 6 days and find it to be 1110 birr. If we
use this value to estimate the daily revenue for the whole year, then the value 1110 birr would be
an estimate.
Definition of Terms:
Interval estimate – The interval, within which a population parameter probably lies, based on
sample information.
Point estimate – A single number computed from a sample & used to estimate a population
parameter.
Sampling error – The difference between a sample statistic and its corresponding population
parameter.
68 Basic Statistics
Confidence interval – An interval estimate which is associated with degree of confidence of
containing the population parameter is called Confidence Interval.
4.2 TYPES OF ESTIMATION
4.2.1 POINT ESTIMATION
Point estimation is a statistical procedure in which we use a single value to estimate a population
parameter. A point estimate is a single number that is used as an estimate of a population
parameter, and is derived from a random sample taken from the population of interest.
Some of the most important point estimators are given below:
Population parameter Point estimator
Mean,
X=
∑ Xi
n
Variance, 2
S 2
=
∑ ( Xi − X )2
n−1
Standard deviation,
S = √ S2
Proportion, π X
P=
n
Example:To
Example:To set the price of a product, one strategy is competition-oriented in which you fix the
price of your product at the average level charged by other producers. Suppose you want to
market a 200-gram bar or soap that you produce. The current wholesale prices charged by a
random sample of 10 soap producers (in birr) are:
1.00 1.35 1.50 0.95 0.90 1.25 1.00 1.20 0.90 and 1.50
What is an estimate of the mean wholesale price charged by all soap producers? Find an estimate
of the standard deviation in the wholesale prices of all the producers?
() is estimated by the sample
Solution: - The mean wholesale price or the population mean (
X= ∑ xi /n
mean X , given by i = (1.00 + 1.35 + ---- + 1.50) / 10 = 1.155
Thus, an estimate of the mean wholesale price charged by all soap producers is 1.155 Birr. Based
on this information, you might set the wholesale price per unit of your product at 1.155 Birr.
The standard deviation in the wholesale prices of all producers, what we call the population
() and is estimated by the sample standard deviation.
standard deviation (
69 Basic Statistics
S=
√ ∑ ( Xi − X )2
i
n− 1
=
√
( 1. 00 − 1 .155 )2 + ( 1 . 35 −1 .155 )2 + −−−−+ (1 . 50 − 1. 155 )2
= 0.237
9
Thus, the wholesale prices fluctuate below and above their mean by about 0.237 Birr, which is
an estimate of the standard deviation in the wholesale prices of all producers.
Example:Suppose
Example:Suppose you are interested to know the proportion of fishes that are inedible as a result
of chemical pollution of a certain lake. In a random sample of 400 fishes caught from this lake,
55 were found out to be inedible. Out of all fishes in this lake, what is an estimate of the
proportion of inedible fishes?
Solution: -
The proportion of inedible fishes in the entire lake is what we call population proportion. Thus is
estimated by the sample proportion:
x 55
P= = = 0.1375 = 13.75 percent.
n 400
Although point estimates are often useful, they do have one serious drawback: we do not know
how close or far these values are from the population value they are supposed to estimate, and
hence, we cannot be certain of their reliability. In other words, a point estimate will be more
useful if it is accompanied by an estimate of the error that might be involved. To this end, we use
interval estimation.
4.2.2 INTERVAL ESTIMATION
Interval estimation is a statistical procedure in which we find a random interval with a specified
probability of containing the parameter being estimated. An interval estimate is an interval that
provides an upper bound and a lower bound for a specific population parameter whose value is
unknown. This interval estimate has an associated degree of confidence of containing the
population parameter. Such interval estimates are also called Confidence intervals and are
calculated from random samples.
The interval estimate is an interval that includes the point estimate. For example, if the sample
mean is say 0.28, one may report that the population mean is in the range of 0.25 and 0.31 with a
probability of 0.95. i.e. the 95 percent confidence interval of the population mean is (0.25, 0.31).
Clearly this interval contains the point estimated 0.28.
70 Basic Statistics
4.3 CONFIDENCE INTERVAL FOR THE POPULATION MEAN ()
CaseI. 1 Sampling from a normally distributed population with known variance
Recall that Z denotes the value of Z for which the area under standard normal curve to its right
is equal to . Analogously, Z/ 2 denotes value of Z for which the area to its right /2 and, Z/2
denotes the value for which the area to its left is / 2.
Consider the following figure
P
(
− Z α /2 <
X−μ
σ / √n )
< Z α / 2 = 1 −α
P(
− Z α / 2 . σ / √ n < X − μ < Z α / 2 . σ / √ n ) = 1− α
P(
X − Z α / 2 . σ / √ n < μ < X + Z α / 2 . σ / √n ) = 1− α
Thus, a (1 - ) 100% confidence interval for the population mean is given by:
X ± Zα / 2 σ / √ n
α
X Zα / 2 2
Where is the sample mean, is the value of Z for which the area to its right is . Common
confidence intervals are the 95 percent and the 99 percent confidence intervals. The 95 percent
confidence interval means that about 95 percent of the similarly constructed intervals will
contain the parameter being estimated. If we use the 99 percent level of confidence, then we
expect about 99 percent of the intervals to contain the parameter being estimated.
71 Basic Statistics
Another interpretation of the 95 percent confidence interval is that 95 percent of the sample
means for a specified sample size will be within 1.96 standard deviations of the hypothesized
population mean. Similarly, for a 99 percent confidence interval, 99 percent of the sample means
will lie within 2.58 standard deviations of the hypothesized population mean.
√( Xi − X )2
S= n−1
Then the 95 % confidence interval is given by
S
X ± 1 . 96
√n
And the 99 % confidence interval is given by
S
X ± 2 . 58
√ n Where
X - Sample mean, S – sample standard deviation
72 Basic Statistics
Example:In
Example:In a certain small city, to estimate the mean monthly expenditure for food, a random
sample of 25 households was randomly selected yielding a mean of 200 birr. From experience, it
is known that such expenditures are normally distributed with a standard deviation of 50 Birr.
a) What is the point estimate of the mean monthly expenditures for food of all households in
the city?
b) Find a 95 percent confidence interval for the mean monthly expenditures for food of all
households in the city.
X ± Zα / 2
σ 50
√n = 200 (1.96) √ 25( )
= 200 19.6
= (180.40 Birr, 219.60 Birr)
() is between
I.e. we are 95 percent confident that the true mean monthly expenditure for food (
180.40 Birr and 219.60 Birr.
Example:A
Example:A manufacturer claims that his tire lasts 20,000 miles on average. A consumer
organization tests a random sample of 64 tires and reported an average of 19,200 miles with a
standard deviation of 2,000 miles. Does a 99 % confidence interval for the mean life of all tires
produced by the manufacturer support the claim?
Solution: -
73 Basic Statistics
Given: n = 64, X = 19,200 miles, S = 2000 miles. Though we have no information about the
normality of the population by central limit theorem, for large n, say n 30. We assume that the
distribution is normal. In our case as n = 64 30 then we consider the normality.
Then for 99 % confidence interval, = 0.01 and /2 = 0.005
And from the table of standard normal,
Z/2 = Z0.005 = 2.58
() will be:
Thus, A 99 % confidence interval for the mean (
X ± Z α / 2 S / √n
74 Basic Statistics
5) If we were to construct 100 similar intervals, about 99 should include the population
mean. Or we are 99 % confident that the population mean is located in the interval.
CaseI. 2 Small sample confidence interval for the population mean: Sampling from a
normally distributed population with 2 unknown and n < 30.
variance2 is not known, then it must be estimated by the sample variance S2 as,
If the population variance
∑ ( Xi − X )2
i
S2 =
n−1
Under this situation, since 2 is estimated by S2, the sampling distribution of the mean deviates
from the Normal distribution for small size, or we say the sampling distribution of X follows
the students t distribution with n – 1 degrees of freedom.
For n > 30, the student t distribution can be approximated by the Normal distribution.
Like the Normal distribution, the t-distribution is symmetrical about the mean = 0. But it is flatter
as compared to the Normal distribution. However, as the sample size increases the t-distribution
losses its flatness and becomes approximately Normal.
The shape of the t-distribution is determined by the degrees of freedom. Degrees of freedom can
be defined as the number of values we can choose freely. Suppose we are dealing with a sample
of size n = 6, and we know the mean of these 6 numbers is 5. Symbolically, we have:
a+b+c+d +e +f
=5
6
Now, we are free to assign any value to a, b, c, d and e,
Say a = 3, b = 2, c = 4, d = 5 and e = 3. But, we are no more free to assign a value to f since:
a +b+c +d +e+f 17 + f
=5 ⇒ = 5 ⇒ 17 + f = 30
6 6
⇒ f =13
That is, in order for the mean of these 6 numbers to be 5, f must be 13. If we assign another
number for f, then the mean will not be equal to 5. Thus, we are free to choose only 5 values and
the 6th one is determined automatically.
Hence, the degrees of freedom is: n – 1 = 6 – 1 = 5
75 Basic Statistics
Generally, for a sample of size n, the degree of freedom is n – 1. The values of t for different
degrees of freedom and different values of X are tabulated. t (n – 1) denotes the value of t for
which the area under the curve to its right is equal to with (n – 1) degrees of freedom.
Example a) for n = 20 and = 0.025, find t (n –1)
Solution:From
Solution:From the t-distribution table, t0.025 (19) = 2.093 (shaded area = 0.025)
b) If n = 26, = 0.005
then t(n – 1) = t0.005 (25) = 2.787
(from the table of t-distribution)
Under such situations, a (1 - ) 100 %. Confidence interval for the population mean is given
by:
X ± t α / 2 (n− 1) S / √ n
Example:One
Example:One measure of a company’s financial health is its debt-to equity ratio. This quantity is
defined to be the ration of the company’s corporate debt to the company’s equity. If this ratio is
too high, it is one indication of financial instability. For obvious reasons, banks often monitor the
financial health of companies to which they have extended commercial loans. Suppose that, in
order to reduce risk, a large bank has decided to initiate a policy limiting the mean debt-to-
equity ratio for its portfolio of commercial loans to 1.5. In order to estimate the mean debt-to-
equity ratio of its loan portfolio, the bank randomly selects a sample of 15 of its commercial loan
accounts. Audits of these companies result in the following debt-to-equity ratios:
1.31 1.05 1.45 1.21 1.19
1.78 1.37 1.41 1.22 1.11
1.48 1.33 1.29 1.32 1.65
A stem-and-leaf display of these ratios is reasonably mound shaped. Furthermore, the sample
mean and standard deviation of these ratios can be calculated to be X = 1.343 and S = 0.192
Suppose that the bank wishes to calculate a 95% confidence interval for a loan portfolio’s mean
debt-to-equity ratio, . Since the bank has taken a small sample of size 15, it is appropriate to
calculate an interval based on the t distribution. We have n – 1 = 15 – 1 = 14 degrees of freedom,
and the level of confidence 100 (1 - ) percent = 95 percent implies that = 0.05. Therefore, we
use the t point t /2 = t0.05 / 2 = t 0.025 = 2.145 (from, the table). It follows that the 95 percent
confidence interval for is
76 Basic Statistics
( X ± t . 025
S
√n ) [
= 1. 343 ± 2 .145
. 192
√ 15 ( )]
= 1.343 0.106
0.106
= 1.237, 1.449
1.449
This interval says that the bank is 95 percent confident that the mean debt-to-equity ratio for its
portfolio of commercial loan accounts is between 1.237 and 1449. Based on this interval, the
bank has strong evidence that the portfolio’s mean ratio is less than 1.5 (or that the bank is in
compliance with its new policy).
4.4 INTERVAL ESTIMATION OF THE POPULATION
PROPORTION
Sample proportion p is the unbiased point estimator for the population, p, and the sampling
distribution is normal when n is large ( np, nq≥0) with:
p− p
z=
√ pq
n
Expression p= p−z δ p
Here however: p=unknown and therefore it is to be estimated using p. The above expression
would become.
p= p−z
√ pq
n
√
δ p= p q that is δ p is estimated by
n
p= p−z
√ pq
n
Since z represents the confidence level we can write the above expression as
p= p ± z α δ p
2
p=0.39± 1.96 ¿)
I.
√
δ p= (0.39)(0.61) = 0.0523
87
α
II. Compute =and work up z α from the table.
2 2
¿ 0.39 ± 1.96 ¿)
¿ 0.39 ± 0.1025
0.2875≤ p ≤ 0.4925
Interpretation of results: We state with 95% confidence that the portion of companies
which used telemarketing to assist order processing lies between 0.2875 and 0.4925
78 Basic Statistics
4.5 DETERMINING THE SAMPLE SIZE IN ESTIMATION
Whenever we take a sample for inferential purposes, there is always a sampling error. This
sampling error is controlled by selecting a sample that is adequate in size. If the sample size is
small, then we may fail to achieve the objective of our analysis, and if it is too large, then we
waste the resources when we gather the sample.
E = Z / 2 / √ n if is known
E = Z / 2 S / √ n if is not known
b. -), the sampling error will not exceed some prescribed
With probability (1 -
quantity E if the sample size is at least:
[ ]
2
Zα / 2 σ
n= E
If n comes out fractional, round up to the next integer.
Example:
Example: The owner of a chain of hotels wants to determine the mean number of rooms
occupied per day (so that he can have an estimate of the average daily revenue obtained by
renting rooms). From past records, the standard deviation of the daily occupancy is known to be
9 rooms.
a) How large a sample of days should be taken so that the true mean number of rooms
occupied per day will not differ from the sample mean by more than 3 rooms at the 95
percent confidence level?
b) At the 99 percent confidence level, what is the maximum error committed in estimating
the true mean by the sample mean if a random sample of 64 days is taken?
Solution: -
Given = 9 rooms
a) E = 3 rooms, (1 - ) 100 % = 95 % = 0.05
Z / 2 = Z 0.025 = 1.96
n= E (
Zα / 2 σ 2
=
3) (
1.96 x 9 2
= 34 .5744 . )
79 Basic Statistics
Therefore, a sample of size at least 35 days is required.
b) n = 64, (1 - ) 100% = 99% = 0.01
Z / 2 = Z0.005 = 2.58
( )
9
E = Z / 2 /√ n = (2.58) √ 64 = 2.9
Therefore, if we use a random sample of 64 days, then we are 99% certain that the error in
estimation will not exceed 2.9 rooms. I.e. the difference between the average daily occupancy
computed from the sample and the true average daily occupancy will not exceed 2.9 rooms.
80 Basic Statistics
CHAPTER - FIVE
HYPOTHESIS TESTING
What is Hypothesis?
Hypothesis is a statement about the value of a population parameter developed for
testing.Hypothesis is an assertion or tentative claim which requires justification.
There are two types of hypothesis:
1) The Null hypothesis:
hypothesis: - is an assertion that a population parameter assumes a fixed value. It
always includes the equality sing, and is denoted by Ho. The null hypothesis is often
established in such a way that it states ‘nothing is different’ from what it is supposed to be, is
claimed to be, or has been in the past.
2) The Alternative/Research hypothesis:
hypothesis: - describes what you will conclude if you reject the
null hypothesis. It is a statement that is accepted if the sample data provide evidence that the
null hypothesis is false. It is written as H 1 and is read “H sub-one”. It is also referred to as the
research hypothesis. The alternative hypothesis is accepted if the sample data provide us with
statistically significant evidence that the null hypothesis is false.
Example 1:A 1:A soft drink bottling company’s advertisement states that a bottle of its product
contains 330 milliliters (ml). But customers are complaining that the company is under filling its
products. To check whether the complaint is true or not, an inspector may test the following
hypothesis:
HO: The average content of a bottle of this product equals 330 ml,against
H1: The average content of a bottle of this product is less than 330 ml.
Or symbolically,
HO: =330 ml
H1: < 330 ml
If the inspector takes a random sample of bottles of this product and finds that the mean content
per bottle is much less than 330 ml, then he may conclude that the complaint of the customers is
correct
Hypothesis testing: - is a procedure based on sample evidence and probability theory used to
determine whether the hypothesis is reasonable statement and should not be rejected, or is
unreasonable and should be rejected. Or hypothesis testing is a procedure for checking the
validity of a statistical hypothesis. It is the process by which we decide whether the null
hypothesis should be rejected or not. The value, computed from sample information, used to
determine whether or not to reject the null hypothesis is called test statistic.
81 Basic Statistics
level of significance is the probability of rejecting the null hypothesis when it is actually true.
The level of significance is also referred to as the level of risk. This may be a more
appropriate term because it is the risk you take of rejecting the null hypothesis when it is
really true. There is no unique level of significance, it depends up on the choice of the
researcher. The researcher must decide on the level of significance before formulating a
decision rule and collecting sample data. There are two commonly used levels of
significances. The .05 and .01.
2. Type II Error:Is
Error:Is the error that is committed in accepting the null hypothesis when it is
().
actually false. The probability of type II error is designated by a Greek letter beta (
The researcher cannot study every item or individual in the population. Thus, there is a
possibility of two types of error – a type I error, wherein the null hypothesis is rejected when it
should have been accepted, and a Type II error, wherein the null hypothesis is accepted when it
should have been rejected.
We often refer to these two possible errors as the alpha error, , and the beta error, , Alpha ( ()
is the probability of making a Type I error, and beta ( () is the probability of making a type II
error.
The following table summarized the decisions the researcher could make and the possible
consequences.
Decision State of nature [null hypothesis]
H O True H O False
Accepts H O Correctdecision Type II error
Rejects H O Type Ierror Correct decision
82 Basic Statistics
Z /2=1.96
Note in the chart that:
1. The area where the null hypothesis is not rejected includes the area to the left of 1.96.
Where the value 1.96 is Z.05/2 read from the table of standard normal distribution.
2. The area of rejection is to the right of 1.96
3. A one-tailed test (right) is being applied (this will be explained soon in the unit)
4. The 0.05 level of significance is chosen.
5. The sampling distribution of the statistic Z is normally distributed.
6. The value 1.96 separates the regions where the null hypothesis is rejected and where it is
not rejected.
7. The value 1.96 is called the critical value.
Figure 2Sampling
Sampling Distribution for the statistic Z, one-tailed (Left) test, and 0.05 level of
significance.
Z /2=−1.96
Figure 3Regions
Regions of non-rejection and Rejection for a Two-Tailed Test, Z-statistics, and 0.05
level of significance.
Z /2=−1.96 Z /2=1.96
We can construct a similar acceptance and rejection region while using the t-statistics.
types of tests
Based on the form of the null and alternative hypothesis, there are two types of tests: a one-sided
test and a two-sided test.
a) A one-sided test: a test is said to be one sided (one tailed) when the alternative hypothesis,
H1, states a direction, such as:
HO: The mean income of females is less-than or equal to the mean income of males.
83 Basic Statistics
H1: The mean income of males is greater than the mean income of females.
There are two kinds of one-sided test
i) A left-tailed test:This
test:This is a type of test in which the less than sign is involved in the
alternative hypothesis. It has one rejection region at the left tail of the appropriate
distribution.
2:Suppose O is the hypothesized (assumed) mean.A left-tailed test for O is
Example 2:Suppose
HO: O
H1: <O
Figure 2 indicates the acceptance and rejection region of left-tailed test using the Z-distribution.
ii) A right-tailed test: This is a type of test in which the greater than sign is involved in the
alternative hypothesis. It has one rejection region at the right tail of the appropriate
distribution.
3:Suppose O is the assumed mean. A right-tailed test for O is
Example 3:Suppose
HO: OH1: >O
Figure 1 can be taken as the acceptance and rejection region of right-tailed test while using the
Z-distribution.
Step 2
Select a level of significance
Step 3
Step 5
Formulate a decision rule
AcceptHO or Reject H1 and
Reject HOaccept H1
Take a sample, arrive at decision
84 Basic Statistics
HYPOTHESIS TESTING OF THE POPULATION MEAN
Mean of the population can be tested presuming different situations such as the population may
be normal or other than normal, it may be finite or infinite, sample size may be large or small,
variance of the population may be known or unknown and the alternative hypothesis may be
two-sided or one-sided. Our testing technique will differ in different situations. We may consider
some of the important situations.
Case 1. Population normal, large sample and population standard
deviation is known. The alternative hypothesis may be one-
sided or two-sided.
In such a situation, Z-test (Z statistics) is used for testing hypothesis of the mean and the test
statistic Z is worked out as under:
X − μO
Z=
σ / √n
Where Z is the standard normal distribution
X = the sample mean
μO = the hypothesized mean
= the population standard deviation
n = the sample size.
Example 4:Selam
4:Selam Hotel has been having average sales of 500 teacups per day. Because of the
development of bus stand nearby, he expects to increase its sales. For the first 50 days after the
start of the bus stand, he recorded an average daily sale of 550 teacups per day. From the past
records, it is known that the sales standard deviation is 50.
On the bases of this sample information, can one conclude thatSelam’s Hotel sales have
increased? Use 5 percent level of significance.
Solution:
Taking the null hypothesis that sales average 500 teacups per day and they have not increased
unless proved, we can write:
HO: ≤500 cups per day
H1: > 500 (as we want to conclude that sales have increased)
As the sample size is 50 > 30 (n > 30), and the population standard deviation = 50 is known,
we shall use Z-test assuming normal population and shall work out the test statistic Z as:
X − μO
Z=
σ /√n
550 − 500 50 1
= = = √ 50 = 7 . 03
50 / √50 50 / √50 1
= √5
85 Basic Statistics
As H1 is one-sided, we shall determine the rejection region applying one-tailed test (in the right
tail because H1 is of more than type) at 5 percent level of significance and it comes to as under,
using table of Z-distribution, Z0.05/2= 1.96
Z0.05/2= 1.96
The rejection region is Z > 1.96, and as the observed (calculated) test statistic is 7.03 which is
greater than 1.96, i.e. which is in the rejection region, thus, there is an evidence to reject H O. I.e.
HO is rejected and we can conclude that the sample data indicate that Selam’s Hotel sales have
shown a considerable increase.
Example 5:The
5:The mean of a certain production process is known to be 50 with a standard
deviation of 3. The production manager may welcome any change is mean value towards higher
side but would like to safeguard against decreasing values of mean.
He takes a sample of 36 items that gives a mean value of 48.5. What inference should the
manager take for the production process on the basis of sample results? Use 1 percent level of
significance for the purpose.
Solution: -
Taking the mean value of the population to be 50, we may write:
HO: O≥50
H1: < 50 (since the manager wants to safeguard against decreasing values of mean)
And the given information as X = 48.5, = 3,
n = 36, assuming the population to be normal, we can work out the test statistic Z as under:
X − μO 48. 5 − 50 −1 . 5 −1 .5
Ζ= = = = = −3. 00
σ /√n 3 / √36 3/6 1
2
As H1 is one-sided in the given question, we shall determine the rejection region applying one-
tailed test (in the left tail as H 1 is of less than type) at 1 percent level of significance and it comes
to as under, using normal curve area table:
Rejection region is Z < Z /2 R: Z < -Z 0.01/2
R: Z < -2.58
86 Basic Statistics
Z/2 = -2.58
The observed value of Z which we call Z calculated is –3 which is < - 2.58 i.e. it is in the
rejection region, and thus, HO is rejected at 1 percent level of significance. We can conclude that
the production process is showing mean which is significantly less than the population mean and
this calls for some corrective action concerning the said process.
Example 6:A 6:A sample of 400 male students is found to have a mean height of 67.47 inches. Can
it be reasonably regarded as a sample from a large population with mean height 67.39 inches and
standard deviation 1.30 inches? Test at 5% level of significance.
Solution:
Taking the null hypothesis that the mean height of the population is equal to 67.39 inches, we
can write:
HO: O = 67.39
H1: 67.39
And the given information as X = 67.47”
67.47” = 1.30” n = 400. Assuming the population to be
normal, we can work out the test statistic Z as under:
X − μO 67.47− 67.39 0.08
Ζ= = = = 1.231
σ/ √n 1.30/ √ 400 0.065
As H1 is two-sided in the given question, we shall be applying a two-tailed test. In the two-tailed
test, the rejection region at 5% level is
R: Z> 1.96
As the observed value of Z is 1.231 which is less than 1.96 or greater than -1.96, 1.231, is in the
acceptance region for the rejection region is R: Z> 1.96. Therefore, Ho is accepted. i.e. we may
conclude that the given sample (with mean height = 67.47”) can be regarded to have been taken
from a population with mean height 67.39” and standard deviation 1.30” at 5% level of
significance.
87 Basic Statistics
* In case if the population is finite but population standard deviation is known, one can use a
test-statistic
X − μO
Z=
σ
√n (√ N −n
N −1 )
N −n
√
Where N − 1 is what we call the finite population correction
And the procedure of testing is the same as in the above three examples.
Example 7:A workers’ union is on strike for higher wages with a total of 1000 population. The
union claims that the mean salary for workers is at most Birr 8,400 per year. The legislator does
not want to reject the union’s claim, however, unless the evidence is very strong against it.
Assume that salaries follow a normal distribution and the population standard deviation is known
to be Birr 3000. A random sample of 64 workers is obtained, and the sample mean is Br, 9,400.
Test if the state legislator accepts the unions’ claim or not at 1% significance level.
Solution:
HO: 8,400
H1: > 8,400
δ=3,000
x=9,400
n=64 0.96795
α =0.01
z 0.01=2.58
X − μO
Z=
σ
√n ( √ NN −− n1 )
9 , 400 − 8 , 400
Z=
3000
√64 (√1000
1000 − 1 )
− 64
Z= 2.585
88 Basic Statistics
z 0.01/ 2=2.58
Example 8:The
8:The Kebede’s discount store chain issues its own credit card. The credit manager
wants to find out if the mean monthly-unpaid balance is more than birr 400. The level of
significance is set at 0.05. A random check of 172 unpaid balances revealed the sample mean to
be birr 407 and the standard deviation of the sample to be birr 38. Should the credit manager
conclude that the population mean is greater than birr400, or is it reasonable to assume that the
difference of birr 7, (407 - 400) is due to chance?
Solution:
The null and alternative hypothesis are stated as:
HO: birr 400
H1: > birr 400
Because the alternative hypothesis states a direction, a one-tailed test is applied. The critical
value of Z, Z/2 = Z0.05/2 = 1.96. The computed value of Z, using the statistic
X − μ 407 − 400
= = 2.42
Z = S/ √ n 38/ √172
As the computed value of the test statistic (2.42) is larger than the critical value (1.96) or as 2.42
is in the rejection region R: Z > 1.96, the null hypothesis is rejected and the credit manager can
conclude that the mean unpaid balance is greater than birr 400.
Case 3. Population Normal, Small sample and Standard deviation of
the population is Unknown
1) If the population is infinite, we use the test statistic
X − μO
t=
S / √ n Which follows a student t-distribution with (n – 1) degree of freedom.
89 Basic Statistics
And S is the sample standard deviation given by the formula,
S= √ ∑ ( Xi − X )2
n− 1
2) If the population is finite, using the finite population correction, the test statistic used is
modified as:
X − μO
S
t = √n √ N−n
N − 1 with (n – 1) df.
While testing, the value of the test statistic calculated from the sample result will be compared
with the tabulated value of the t-distribution at the given level of significance.
Example 9:The
9:The specimen of copper wires drawn from a large lot have the following breaking
strength (in kg. Weight): 578, 572, 570, 568, 572, 578, 570, 572, 596, 544.
Test whether the mean breaking strength of the lot may be taken to be 578 kg. Weight at 5%
level of significance.
Solution: -
Taking the null hypothesis that the population mean is equal to hypothesized mean of 578 kg.,
we can write:
HO: = O = 578 kg.
H1: 578 kg.
As the sample size is small (n = 10) and the population standard deviation is not known, we shall
use t-test assuming normal population and shall work out the test statistic t as under:
X − μO
t = S /√ n
Calculating the sample mean, and sample standard deviation, one can obtain
X = 572 kg. And S = 12.72 kg.
572 − 578
= −1. 488
Hence t = 12 .72/ √ 10
Degree of freedom = (n – 1) = (10 – 1) = 9.
As H1 is two-sided, we shall determine the rejection regions applying two-tailed test at 5% level
of significance, and it comes to as under, using table of t-distribution for 9degree of freedom.
R: t > 2.262 t > t / 2 or t < -t / 2
or t > t / 2
or
Acceptance region – 2.262 < t < 2.262
As the observed value of t (i.e. –1.488) is in the acceptance region, we accept H O at 5% level and
conclude that the mean breaking strength of copper wires lot may be taken as 578 kg. Weight
* For two-tailed test in t-distribution, if the level isis, the area to the right tail and the area to the
left tail must add up to . Thus we consider them each / 2.
90 Basic Statistics
91 Basic Statistics