Bio Statistics
Bio Statistics
Bio Statistics
STATISTICS
WHAT IS STATISTICS?
Statistics is a scientific body of knowledge that deals with the collection, organization or presentation, analysis, and
interpretation of data.
- Data are facts or a set of information or observation of the study.
- Collection refers to the gathering of information or data.
- Organization or presentation involves summarizing data or information in textual, graphical, or tabular forms.
- Analysis involves describing the data by using statistical methods and procedures.
- Interpretation refers to the process of making conclusions based on the analyzed data.
TWO CATEGORIES OF STATISTICS
a. Descriptive Statistics is a statistical procedure concerned with the describing the characteristics and properties of a
group of persons, places, or things.
Example: We may describe a collection of persons by stating how many are poor and how many are rich, how many are
literate and how many are illiterate, how many fall into various categories of age, height, civil status, IQ, and many more. We may
also describe a particular hospital in terms of the number of patients it has, the number of clinical units, the number of doctors or
the number of nurses.
b. Inferential Statistics is a statistical procedure that is used to draw inferences or information about the properties or
characteristics by a large group of people, places, or things on the basis of the information obtained from a small portion of a large
group. It is concerned with reaching conclusions. At times the information available is incomplete and generalizations are reached
based on the data available.
Example: As a result of the increase in the number of patients in a hospital this week because of meningococcemia, it is
expected that the number of patients will double next week.
Suppose we want to know the most favorite brand of toothpaste of a certain barangay and we do not have
enough time and money to interview all the residents of that barangay, we may just ask selected residents. With the data obtained
from the interviews, we shall draw or make a conclusion as to the barangays favorite brand of toothpaste.
TERMINOLOGIES
1. Population refers to a collection of objects, persons, places, or things. To illustrate this, suppose a researcher wants to determine
the average income of the residents of a certain barangay and there are 1500 residents in the barangay. Then all these residents
comprise the population. A population is usually denoted or represented by N. Hence in this case, N=1500.
2. Sample is a small portion or part of a population. It could also be defined as a subgroup, subset, or representative of a population.
For instance, suppose the above-mentioned researcher does not have enough time and money to conduct the study using the whole
population and he wants to use only 200 residents. These 200 residents comprise the sample. A sample is usually denoted by n, thus
n = 200.
3. Parameter is any numerical or nominal characteristic of a population. It is a value or measurement obtained from a population. It
is usually referred to as the true or actual value. In the preceding illustration, the researcher uses the whole population (N=1500),
then the average income obtained is called a parameter.
4.Statistic is an estimate of a parameter. It is any value or measurement obtained from a sample. If the researcher in the preceding
illustration makes use of the sample (n=200), then the average income obtained is called a statistic.
5.Data (singular form is datum) are facts, or a set of information or observations under study. More specifically, data are gathered
by the researcher from a population or from a sample.
Two categories of data
a. Qualitative data are data which can assume values that manifest the concept of attributes. These are sometimes called
categorical data. Data falling in this category cannot be subjected to meaningful arithmetic operations. They cannot be added,
subtracted, or divided. Gender, nationality, marital status, educational level, and race are qualitative data.
b. Quantitative data are data which are numerical in nature. These are data obtained from counting or measuring. In
addition, meaningful arithmetic operations can be done with this type of data. Test scores, height, weight and blood pressure are
quantitative data.
Types of data
a. Raw data are in their original form and structure.
b. Grouped data are placed in tabular form characterized by class intervals with the corresponding frequency.
c. Primary data are measured and gathered by the researcher that published it. It refers to information which is gathered
directly from an original source or which are based on direct or firsthand experiences. Example: first-person account,
autobiographies, diaries
d. Secondary data are republished by another researcher or agency. It refers to information which is taken from published
or unpublished data which were previously gathered by other individuals and agencies
Example: published books, newspapers, magazines, biographies, business reports, and the likes.
6.Constant is a property or characteristic of a population or sample which makes the members of the group similar to each other.
For example, if a class is composed of all boys, then gender is a constant
7. A variable refers to a characteristic or property of a population or sample which makes the members different from each other. If
a class consists of boys and girls, then gender is a variable in this class. Height is also a variable because different people have
different heights.
Classification of Variables
A. According to Functional Relationship
a.Dependent Variable is a variable which is affected or influenced by another variable.
b. Independent Variable is one which affects or influences the dependent variable.
2
B. According to Continuity of Values
a. Discrete Variableis one that can assume a finite number of values. In other words, it can assume specific values
only. The values of a discrete variable are obtained through the process of counting.
The number of patients in a clinical unit is a discrete variable. If there are 40 patients, it cannot be reported that there are
40.2 students or 40.5 students.
b. Continuous Variable is one that can assume infinite values within a specified interval. The values of a continuous
variable are obtained through measuring. Continuous variables are those that fall into the category of measured to the nearest.
Data measured in decimal fractions, but recorded to the nearest whole, are still continuous data.
For example, height is a continuous variable. If one reports that the height of a building is 15 m, it is also possible
that another person reports that the height of the same building is 15.1 m or 15.12 m, depending on the precision of the measuring
device used. In other words, the height of the building can assume several values.
A person two months away from their 22
nd
birthday is actually closer to age 22 than to age 21, but in most
instances that person would be considered to be age 21 until their actual 22
nd
birthday.
C. According to Scale of Measurements
Statistics deals mostly with measurements. We define measurement as the assignment of symbols or numerals to objects
or events according to some rules. Since different rules are used for the assignment of symbols, then this would yield different scales
of measurement.
1.Nominal Scale this is the most primitive level of measurement. The nominal level of measurement is used when
we want to distinguish one object from another for identification purposes. In this level, we can only say that one object is different
from another, but the amount of difference between them cannot be determined. We cannot tell that one is better or worse than
the other. Telephone numbers, zip code, credit card numbers, gender, nationality, and civil status are of nominal scale.
2. Ordinal Scale data are arranged in some specific order or rank. When objects are measured in this level, we
can say that one is better or greater than the other, but we cannot tell how much more or how much less of the characteristic one
object has than the other. The ranking of contestants in a beauty contest, number of siblings in the family and honor of the students
in the class are ordinal scale.
3.Interval Scale- if data are measured in the interval level, we can say not only that one object is greater or less
than another, but we can also specify the amount of difference. The scores in an examination are of the interval scale of
measurement. To illustrate, suppose Maria got 50 in a Math examination while Martha got 40. We can say that Maria got higher
than Martha by 10 points.
4.Ratio Scale this level of measurement is like the interval level. The only difference is that the ratio level always
starts from an absolute or true zero point. In addition, in the ratio level, most of the datahave the presence of units of measure. If
data are measured in this level, we can say that one object is so many times as large or as small as the other. For example, suppose
Mrs. Reyes weighs 50 kg, while her daughter weighs 25 kg. We can say that Mrs. Reyes is twice as heavy as her daughter. Thus,
weight is an example of data measured in the ratio scale.
EXERCISES
A. Indicate whether the data represented in each of the following is a part of a population or a sample.
1. Twenty-five cases of TB have been reported in the past year and a patient care evaluation study is to be carried out using
data from all 25 cases.
2. A total 388 chest x-rays were performed during the past month. A quality control review is to be carried on 10% of the
group.
B. Tell whether the following situations will make use of descriptive or inferential statistics.
1. A teacher computes the average grade of her students and determines the top ten students.
2. The CEO of a hospital predicts the decrease in the number of patients that will be admitted next year based on the data
collected this year.
3. A school administrator forecast future expansion of a school.
4. A researcher determines the total number of patients in ZCMC.
5. A researcher investigates the effectiveness of a beauty product.
C. Indicate whether the following represent qualitative or quantitative data.
1. Place of birth
2. Type of insurance
3. Condition of the patient at time of discharge
4. Number of hospital admissions
D. Indicate whether the following is a discrete or continuous variable.
1. Birth weight
2. Number of times a patient sees her physician during the year
3. Minutes needed to walk a mile
4. Number of possible outcomes in throwing a die.
5. Data obtained in decimal form.
E. Determine the scale of measurement of the following.
1. Weight 5. Placement in the 100-meter dash
2. Educational level 6. Acceleration of a vehicle
3. License plate number 7. Number of patients of a hospital in a day
4. Examination scores 8. Civil status
3
DATA GATHERING TECHNIQUES
I.Collecting Data
2 Sources of Data
1. Primary sources of data are the government institutions, business agencies, and other organizations.
Example: 1. Data are gathered from the National Statistics Office (NSO).
2. Information derived from personal interviews.
2. Secondarysources are books, encyclopedia, journals, magazines, and research or studies conducted by the individuals.
Various Ways of Collecting Data
1. The Direct or Interview Method- in this method, the researcher or interviewer has a direct contact with the interviewee.
The researcher obtains the information needed by asking questions and inquiries from the interviewee. This method is usually used
in most research. In this method the researcher can get more accurate answers or responses since clarification can be done by the
interviewee or respondent does not understand the question. However, this method is costly and time consuming.
Example:
a. A business firm would interview residents of a certain barangay regarding favorite brand of toothpaste, soap or shoes.
b. A nurse would interview patients regarding their birthdates, residence and etc.
2. The Indirect or Questionnaire Method- this method makes use of the questionnaire. The researcher gives or distributes
the questionnaire to the respondents either by personal delivery or by mail. Using this method, the researcher can save a lot of time
and money in gathering the information needed because questionnaires can be given to a large number of respondents at the same
time. However, the researcher cannot expect that all distributed questionnaires will be answered because some of the respondents
simply ignore the questionnaires. In addition, clarification cannot be made by the respondent who does not understand the
question.
3. The registration Method- this method of gathering is enforced by certain laws.
Example: registration of births, registration of deaths, registration of vehicles, registration of marriages, registration of license.
4. The Experimental Method- this method is usually used to find out cause and effect relationship of certain phenomena
under controlled conditions. Scientific researchers often use this method.
Example: Agriculturists would like to know the effect of a new brand of fertilizer on the growth of plants. The new kind of fertilizer
will be applied to ten sets of plants, while another ten set of plants will be given the ordinary fertilizer. The growth of the plants will
then be compared to determine which fertilizer is better.
II. Sampling Techniques
This is a procedure used to determine the individuals or members of a sample.
Example: Suppose a guidance counselor of a certain school wants to determine the average weekly allowance of the students, if
there are 2000 students in this school and the guidance counselor decided to use only 100 students as a sample, who will be
included in the sample?
*Sampling techniques are used to answer the question concerning who will be included in the sample.
2 Types of Sampling Techniques
1. Probability Sampling- is a sampling technique wherein each member or element of the population has an equal chance of being
selected as members of the sample.
Several Probability Sampling Techniques
A.Random Sampling-this is the basic type of random sampling. Using this technique each individual in the population has an
equal chance of being drawn into the sample. Selecting the members or elements of our sample using this technique can be dome in
two ways, namely, the lottery method and the use of table of random numbers. Remember that when we use these methods we
should have a complete list of the members of the population.
A.1Lottery Method
Suppose Mrs. Cruz wants to send five students to attend a 2-day training or seminar in basic computer programming. To
avoid bias in selecting these five students from her 40 students, she can use the lottery method. This is done by assigning a number
to each student and then writing these numbers on pieces of paper. Then, these pieces of paper will be rolled or folded and placed
in a box called lottery box. The lottery box should be thoroughly shaken and then five pieces of paper will be picked or drawn from
the box. The students who were assigned to the numbers chosen will be sent to training. In this case, the selection of the students is
done without bias. Note that we can simply assign 1 to the first student, 2 to the second student, 3 to the third student, and so on.
A.2 Table of Random Numbers
We can also use the table of random number to select or draw the members of the sample. Below is a portion of the table
of random numbers.
31871 60770 59235 41702
87134 32839 17850 37359
06728 16314 81076 42172
95646 67486 05167 07819
44085 87246 47378 98338
Let us illustrate how this random numbers are used to select the members of the sample. Let us consider the preceding
example wherein Mrs. Cruz wants to select 5 students from her 40 students. Again, we will assign a number to each student, say
from 1 to 40.
4
Since there are 40 students, we will use the two-digit numbers of the table of random number when selecting the members
of the sample. This is because the students have been assigned with numbers 01, 02, 03, 04 up to 40. Looking at the first column of
the table of random numbers above, we see that the number formed by the first two digits is 31, hence the student assigned to
number 31 chosen as a member of the sample. If we proceed down the column, we see that the number formed is 87 which cannot
be used because we have only 40 members. In a similar manner, the third number is 06, so that the student assigned to number 6 is
chosen. Notice that the next two numbers from the table are 95 and 44, numbers we cannot use for the same reason before. When
we get to the bottom of the column, we move up the column and merely shift one digit to the right for the next random number.
Thus, we will have 18 as our next number. This is one of the many alternatives. We can have other ways of selecting the members of
the sample until we complete the 5 students.
B. Systematic Sampling
Notice that if we are to select the members of the sample from a large population, the simple random technique is a long
and difficult process. An easier alternative is the use of systematic sampling technique. To draw the members or elements of the
sample using this method, we have to select a random starting point, and then draw successive elements from the population. In
other words, we pick every nth element of the population as a member of the sample when we use this method.
Let us use the example wherein Mrs. Cruz wants to select 5 students from her 40 students. First, we select a random
starting point. This is done by dividing the number of members in the population by the number of the members in the sample.
Hence, in our case we shall have i = 8. The next step is to write the numbers 1, 2, 3, 4,5,6,7, and 8 on pieces of paper and draw one
number by lottery. If we were able to get 5, this means that we will select every 5
th
student in the population as members of the
sample. Therefore, the 5
th
, 10
th
, 15th, 20
th
, and 25
th
student shall be the members of the sample. If, for instance, we were able to
obtain the number 6, then the members of the sample will be the 6
th
, 12
th
,18
th
, 24
th
and 30
th
students.
C. Stratified Random Sampling
There are some instances whereby the members of the population do not belong to the same category, class, or group. To
illustrate this, let us suppose that we want to determine the average income of the families in a certain community or barangay. In a
typical barangay, different families belong to different income brackets we will draw or select members of the sample using simple
random sampling; there is a possibility or chance that none of the families or disproportionate number of families from the l ow-
income, average-income, or high-income group will be included in the sample. In this case, the result of the study should conclude
that the average income of the families living in this barangay is high. This suggests that the sample that should be drawn from the
population should be proportionally drawn from each group or category-the high, the average and the low-income families.
To do this, we will use the stratified random sampling. The word stratified comes from the root word strata which mean
groups or categories (singular form of stratum). When we use this method, we are actually dividing the elements of a population
into different categories or subpopulation. Let us consider the following example.
Example: Suppose a community consists of 5000 families belonging to different income brackets. We will draw 200 families as our
sample using stratified random sampling. Below are the subpopulations and the corresponding number of families belonging to each
subpopulation or stratum.
Strata Number of Families
High-income families 1000
Average-income families 2500
Low-income families 1500
N = 5000
Solution: The first step is to find the percentage of each stratum. This is done by dividing the number of families in each stratum by
the total number of families. Then, we multiply each percentage by the desired number of families in the sample. The table below
shows how it is done.
Strata Number of Families Percentage Number of Families in the sample
High 1000 % 20 2 . 0
5000
1000
or = 0.2 x 200 = 40
Average 2500 % 50 5 . 0
5000
2500
or = 0.5 x 200 = 100
Low 1500 % 30 3 . 0
5000
1500
or = 0.3 x 200 = 60
N = 5000
n = 200
From the above table, we see that if we are going to draw 200 members from a population of 5000, we should draw 40
families belonging to the high-income, 100 from the average, and 60 from the low-income group. Observe that the number of
families drawn as sample in each stratum is proportional to the number of families from the population.
Sometimes, the population is so large that the use of simple random sampling\g will prove tedious and difficult. Under this
condition, we can cluster sampling.
D. Cluster Sampling is sampling wherein groups or clusters instead of individuals are randomly chose. Recall that in the
simple random sampling we select members of the sample individually. In cluster sampling, we will select or draw the members of
the sample by group and then we select a sample of elements from each cluster or group randomly. Cluster sampling is sometimes
called area sampling because this is usually applied when the population is large.
To illustrate the use of this sampling method, lets suppose that we want to determine the average income of the families in
Manila. Let us assume there are 250 barangays in Manila. We can draw a random sample of 20 barangays using simple random
sampling, and then a certain number of families from each of the 20 barangays may be chosen.
E. Multi-stage Sampling
Multi- stage sampling is a combination of several sampling techniques that we have discussed. Usually this method is used
by researchers who are interested in studying a very large population; say the whole island of Luzon, or even the Philippines. This is
5
done by starting the selection of the members of the sample using cluster sampling and then dividing each cluster or group into
strata. Then, from each stratum individuals are drawn randomly using simple random sampling.
2. Non-Probability Sampling is a sampling technique wherein members of the sample drawn from the population based on the
judgment of the researchers. The results of the study using this sampling technique are relatively biased. This technique lacks
objectivity of selection; hence, it is sometimes called subjectivesampling. Inferences made based on the sample obtained using these
techniques are not so reliable.
Non-probability sampling techniques are used because they are convenient and economical. Researchers use these
methods because they are inexpensive and easy to conduct.
Under this technique, there are several methods which can be used to draw or select the members of the sample.
A.Convenience Sampling- as the name implies, convenience sampling is used because of the convenience it offers to the
researcher. For example, a researcher who wishes to investigate the most popular noontime show may just interview the
respondents through the telephone. The result of this interview will be biased because the opinions of those without telephone will
not be included. Although convenience sampling may be used occasionally, we cannot depend on it making inferences about a
population.
B.Quota Sampling in this type of sampling, the proportions of the various subgroups in the population are determined
and the sample is drawn to have the same percentage in it. This is very similar to the stratified random sampling discussed above.
The only difference is that the selection of the members of the sample using quota sampling is not done randomly. To illustrate this,
let us suppose that we want to determine the teenagers most favorite brand T-shirt. If there are 1000 female and 1000 male
teenagers in the population and we want to draw 150 members for our sample, we can select 75 female and 75 male teenagers
from the population without using randomization. This is quota sampling.
C.Purposive Sampling this is another method of drawing members of the sample using non-probability sampling. Let us
suppose that we want to determine or predict the candidate who will win in the upcoming election. We can conduct the survey or
interview in places or precincts where people voted for the winner in a series of post elections because we feel objectively that they
will,. Again vote for the next winner in the upcoming election. Also, let us suppose that the target is to find out the affectivity of a
certain kind of shampoo. Of course, bad fellows will not be be included in the sample.
III. Determining Sample Size
To determine the sample size from a given population, the Slovins formula is used.
Slovins formula:
2
1 Ne
N
n
+
=
Where n = sample size
N = population size
E = margin of error
To illustrate, suppose we want to find the average of the students in Manila. However, due to insufficient time, only the
students in three particular schools were used to estimate the average age. Obviously, the result is not the actual average but just an
estimate and thus, there is usually an error when we use the sample instead of the population.
Example 1: A group of researchers will conduct a survey to find out the opinion of residents of a particular community regarding the
oil price hike. If there are 10000 residents in the community and the researchers plan to use a sample using a 10% margin of error,
what should the sample size be?
Solution: Here, N = 10000 and e= 10% or 0.10. Substituting the given values in the formula, we have:
2
1 Ne
N
n
+
= =
2
) 10 )(. 10000 ( 1
10000
+
n =
) 01 )(. 10000 ( 1
10000
+
n = 99.01 or 99
Hence, the researchers will just conduct the survey using 99 residents. A 10% margin of error means that the researcher is
90% confidents that the result obtained using the sample will closely approximate the result had he used the population.
Example 2: Suppose that in Example 1, the researchers would like to use a 5 % margin of error. What should be the size of sample?
Solution: Here N = 10000 and e = 5% or 0.05. Substituting the given values in the formula, we have
2
1 Ne
N
n
+
= =
2
) 05 )(. 10000 ( 1
10000
+
n =
) 0025 )(. 10000 ( 1
10000
+
n =
25 1
10000
+
n = 384.62 or 385
6
Summation Notation
In our study of statistics, we shall be using mathematical symbols. These symbols are useful especially in writing formulas.
The most common symbols or notation used in statistics is the summation notation or simply summation (
).
Recall that variables are represented by using capital letters. If our variable is age, then we can represent this by X. Hence, if
there are 40 students in a class, we can represent the age of the first students by X
1
, the age of the second student by X
2
, the age of
the third student by X
3
, and so on. If we want to find the sum of these ages, then we can write the sum in this way:
X
1
+ X
2
+ X
3
+ X
4
+ ..+X
40
To write the sum of n values or measurement in a simpler way, the summation notation, represented by the Greek capital
letter
=
40
1 i
i
X (read as the summation of X sub i, from i =1 to i =40)
Here iis the index of summation and its value ranges from 1, the lower limit, to 40, the upper limit. Observe also that when
we write the sum of values in summation notation, we replace the subscript of the sum variable by an arbitrary subscript i and
indicate in the index the range of the summation. More examples on writing the summation notation are shown below:
1. X
1
+ X
2
+ X
3
+ X
4
+ ..+X
100
=
=
100
1 i
i
X
2. (Y
4
+ 5) + (Y
5
+ 5) + (Y
6
+ 5) +.+ (Y
20
+ 5) =
+
20
41
) 5 (
i
i
Y
Sometimes instead of writing the given sum in summation notation, we are asked to expand the given summation notation.
Example 1 Expand the following:
1.
=
10
7
3
i
i
X
2.
=
+
5
2
) (
i
i i
B A
Solution: Applying the definition of summation, we have
1.
=
10
7
3
i
i
X =
3
7
X +
3
8
X +
3
9
X +
3
10
X
2.
=
+
5
2
) (
i
i i
B A = ) (
2 2
B A + + ) (
3 3
B A + + ) (
4 4
B A + + ) (
5 5
B A +
Suppose the values of a variable are given and we are asked to evaluate the given summation, then we simply substitute
the values in the expanded form of the summation. Examples on how to evaluate summation notation are as follow:
Example 2 Given the following: X
1
= 2 X
2
= 4 X
3
= 5
Y
1
=1 Y
2
=3 Y
3
=7
Evaluate:
1.
=
3
1 i
i
X 2.
=
+
3
1
) (
i
i i
Y X
Solution: Using the summation notation, we have:
1.
=
3
1 i
i
X =
1
X +
2
X +
3
X 2.
=
+
3
1
) (
i
i i
Y X = ) (
1 1
Y X + + ) (
2 2
Y X + + ) (
3 3
Y X +
= 2 + 4 +5 = (2+1) + (4+3) + (5+7)
= 11 = 3 + 7 +12 = 20
7
PRESENTING AND DESCRIBING DATA
After data have been gathered and checked for possible errors, the next logical step will be to present the data in a manner that is
easy to understand. It should also readily convey the relevant information and the important results at a glance.
Ungrouped Data are data that are not organized, or if arranged, could only be from highest to lowest or lowest to highest.
Grouped Data are data that are organized and arranged into different classes or categories.
Three methods in presenting data
1. Textual
2. Tabular
3. Graphical
TEXTUAL
In textual from, the presentation is in narrative or paragraph form. The data are within the text of the paragraph. This
involves enumerating the important characteristics, giving emphasis on significant figures and identifying important features of the
data. This form may not get immediate interest of the reader. However, it can present a more comprehensive picture of the data
because of further written explanation of its nature.
Example:
1. Nominally, the peso improved by 1.4 percent as of April 14, 2003 compared to its level in 2002, followed by the Thai
baht, which gained 0.86 percent; Indonesian rupiah, 0.68 percent; and Taiwan dollar, 0.2 percent.
Other currencies on the other hand, depreciated during the same period. The Singapore dollar fell 2.33 percent. The South
Korean won slid 2.14 percent while the Japanese yen dropped 0.61 percent. (Phil Daily Inquirer, April 17, 2003, p.B2)
2. Here is the list of scores for the math exam of the top 10 students in the 4
th
year class:
95 95 94 93 94 93 90 91 91 95.
TABULAR
Sometimes, we could hardly grasp information from textual presentation of data. Thus, we may present data by using
tables.
By organizing data in tables, important feature about the data can readily understood and comparisons can be easily made.
Thus, a table shows complete information regarding the data. A table has the following parts:
1. Heading: it includes the following:
a. Table number: This is for easy reference to the table.
b. Table title: It briefly explains the content of the table.
3. Box head/ Column header: It describes the data in each column.
4. Stubs/Row classifier: IT shows the classes or categories.
5. Body: This is the main part of the table.
6. Foot note/Source note: This is only placed below the table when the data written are not original; that is, it indicates the source of
data.
Below is a table with all its parts indicated:
Table 1 Table Number
Distribution of Students in XYZ High School According to Year Level Table Title
Box head
Body
Source: XYZ High School Registrar Source Note
1. A FREQUENCY DISTRIBUTION TABLEis a table which shows the data arranged into different classes and the number of cases
which fall into each class. Frequency refers to the number of occurrences of each datum or a group of a data in a given set.
Frequency for each datum, interval or class is denoted by f
i
, where i refers to the order of the datum, interval or class.
I. The Frequency Distribution for Ungrouped Data is simply an arrangement of data in a Table from lowest to highest which shows
the number of occurrences of each value or datum in a set. This is best used when the range of the values is not too wide.
To illustrate a frequency distribution table for ungrouped data, we have the following:
Table 2
Ungrouped Frequency Distribution for the Ages of 50 Students Enrolled in Statistics
Age Frequency (f
i
)
14 4
15 13
16 25
17 5
18 2
19 1
N = 50
Year Level Number of Students
First Year 300
Second Year 250
Third Year 285
Fourth Year 215
N = 1050
Stubs
Body
8
II. TheFrequency Distribution for Grouped Data is an arrangement of data into different classes or categories. It involves counting
the data which fall into each interval or class.
**Steps in Constructing a Frequency Distribution for grouped data**
1. Find the rangeR. R = highest value-lowest value.
2. Estimate the number of classes, k. The following formula may be used:
a.k = n ,
b. k = 1 + 3.322log
10
n ,where k = number of intervals
n = number of observations
Note: the results for k are rounded off to the next higher integer, NOT the usual nearest integer. In addition, it is advisable
to use the higher figure for k.
3. Estimate the width c of the interval by dividing the range R by the number of classes, k. Round off this estimate to the same
number of significant decimal places as the original set of data.
4. List the lower and upper class limits of the first interval. This interval should contain the smallest observation in the data set. The
starting lower limit could be the lowest observation or any number closest to it.
5. List all the class limits by adding the class width to the limits of the previous interval. The highest class should contain the largest
observation in the data set.
6. Tally the frequencies for each class.
7. Compute the class marks and class boundaries.
a. the class mark or class midpoint is the midpoint of an interval. It is computed by adding the lower limit and the upper
limit then dividing the result by 2. The class mark for the ith interval is:
2
i i
i
u l
x
+
= , where
i
l is the lower class limit and
i
u is the upper class limit for the ith interval.
b. to find the class boundaries, its important to know the unit of accuracy of the given data. Data in whole number form
are accurate to onesunit, data expressed in one decimal place are accurate to the tenth unit, while data expressed in two decimal
places are accurate to the hundredth unit, and so on.
i. The lower class boundary (L
i
) is computed by subtracting 1/2 unit from the lower limit class; that is,
L
i
=
i
l (1/2)unit or L
i
=
i
l (0.5)unit.
ii. The upper class boundary (U
i
) is computed by adding 1/2 unit to the upper class limit; that is,
U
i
=
i
u + (1/2)unit or U
i
=
i
u + (0.5)unit
Note: This is done to close the gap between two adjacent intervals.
Example 1:
Construct a frequency distribution for the following scores which were the results on an examination in STAT 101.
18 28 15 10 47 31 32 29 58 48
37 49 26 54 56 21 24 28 32 61
43 12 23 29 28 16 42 40 32 26
48 36 39 22 40 20 63 54 30 17
18 30 23 26 36 47 19 25 38 35
Step 1: Compute the range. R=___-___=___
Step 2: Estimate the number of classes.
a. k= __ =
b. k=1 + 3.222log
10
___=
Step 3: Estimate c, the width of the intervals.
c= = =
Step 4: List the lower limit and upper limits of the first interval.
1
st
Lower limit:____
1
st
Upper limit:____
1
st
interval:_______
Step 5: List the succeeding intervals.
Lower limits Upper limits Intervals
Since c=___, ___+___=___ and ___+___=___ _______
___+___=___ and ___+___=___ _______
___+___=___ and ___+___=___ _______
___+___=___ and ___+___=___ _______
___+___=___ and ___+___=___ _______
___+___=___ and ___+___=___ _______
___+___=___ and ___+___=___ _______
___+___=___ and ___+___=___ _______
9
Step 6: Tally the frequencies.
Table 3
Frequency Tally of the Scores in the Examination in STAT 101
Class Intervals Tally Frequency
Step 7: Compute the class marks and class boundaries.
For the 1
st
interval (i=1),
2
1
+
= x = is the 1
st
class mark
L
1
= ___- (0.5)__ = ___ - ___ = ___ is the lower class boundary
U
1
= ___+ (0.5)__ = ___ + ___ = ___ is the upper class boundary
For the 2
nd
interval (i=2),
2
2
+
= x = is the class mark
L
2
= ___- (0.5)__ = ___ - ___ = ___ is the lower class boundary
U
2
= ___+ (0.5)__ = ___ + ___ = ___ is the upper class boundary
Continuing in this manner, we get the values until the last (____) interval:
2
8
+
= x = is the class mark
L
8
= ___- (0.5)__ = ___ - ___ = ___ is the lower class boundary
U
8
= ___+ (0.5)__ = ___ + ___ = ___ is the upper class boundary
From these computations we form the Table below.
Table 4
Frequency Distribution for the STAT 101 Scores
Class
Intervals
Class
Boundaries
Class
Marks
x
i
Frequency
f
i
III.Relative Frequency Distribution
The fourth column of Table 4 containing the frequency may be replaced by a column containing relative frequency.
The relative frequency for each interval is found by dividing the class frequency by the total frequency.
Table 5
Relative Frequency Distribution for the STAT 101 Scores
Class
Intervals
Class
Boundaries
Class
Marks
x
i
Relative
Frequency
10
IV. Cumulative Frequency Distribution shows the number of observations falling below a specific value. The cumulative frequency,
denoted by F
i
, associated with the upper class boundary of a particular interval is computed by summing the frequency for that
interval and the frequency of all intervals below it.
Table 6
Cumulative Frequency Distribution for the STAT 101 Scores
Class
Boundaries
Cumulative
Frequency F
i
Less than 9.5 0
V. Frequency Distribution in Categories
When data are in categories, constructing a frequency distribution will be easy. It only involves summing up all the values
for each category. Adding a column for percentages will also provide valuable information
Table 7
Distribution of the number of students of each year level in the BS Math program
Year Level Frequency Percentage
First year 10
Second year 8
Third year 6
Fourth year 12
2.STEM AND LEAF PLOT
Grouping data into classes, results to some loss of information. When data is presented in Table 4, the original data set can
no longer be seen. Suppose we use a different set of classes for the STAT 101 scores and tally the frequencies as shown in Table 8:
Table 8
Class Intervals Tally Frequency
60-64
55-59
50-54
45-49
40-44
35-39
30-34
25-29
20-24
15-19
10-14
If we dont want to lose the original observation, we can replace the tally marks with the last digit of the corresponding
data to get Table 9 below.
Table 9
60-64
55-59
50-54
45-49
40-44
35-39
30-34
25-29
20-24
15-19
10-14
11
The above table can also be presented as
Table 10
6
5*
5
4*
4
3*
3
2*
2
1*
1
Table 10 is called a stem-and-leaf plot where each line is a stem and each digit to the right of the vertical bar is a leaf. The
numbers on the left of the vertical bar are called stem labels. The stem label 5* represents values from 55-59 while stem label 5
corresponds to values from 50-54. The symbols * and are called placeholders and are only used to differentiate between to
similar stem labels. Any other symbol may be used.
From a stem-and-leaf plot, the original data can be reproduced, unlike in a frequency distribution. One stem and one leaf
reproduce one datum or observation. A stem label of 6 and a leaf of 1 give the value 61.
The numerical data provided in a frequency distribution can be made more interesting and easier to understand when depicted i n
graphical form. A graph is a pictorial or geometrical representation of a given data.
These may be in the form of a:
1. Frequency polygon 2. Bar graph 3. Stem and leaf display 4. Pie graph 5. Pictograph
The frequency polygon is also sometimes called the line graph because the graph is just a line connecting the points representing
the important data in the xy-pane.
Another common device used to show a frequency polygon is the histogram. A histogram consists of a set of rectangles having bases
on a horizontal axis which center on the class marks. The base widths correspond to the class size, and the heights of the rectangles
correspond to the class frequencies
The bar graph is an illustration of the data using bars in the xy-plane
The stem and leaf display is another visual illustration of the distribution of data. This form, however, is feasible only for a small
number of observations with at least two-digit numbers.
Example 6: Using the previous raw data on the Arithmetic Test (from example 1), the stem will be the tens digits arranged in order
and the leaf will be the ones digits. The presentation will be:
8 2
7 8 2
6 2 2 0 2 8 6 5 8
5 6 4 4 3 2 0 6 6 5 2 5 7 6
4 2 4 8 7 1 7 2 8 8 2 7 2
3 7 9 8
2 8
Exercise: The following represent the measurements provided by a group in a Physics activity:
5.4 3.2 4.7 3.3 6.1
4.6 2.5 3.2 4.1 5.5
5.9 4.4 3.4 3.8 4.8
4.3 4.2 5.2 4.7 3.8
Construct a stem and leaf display with the ones digits as the stem and the tenths digits as the leaf portion.
The pie graph is also known as the circle graph. Obviously, the presentation makes use of a circle to represent given data that make
up a whole.
12
With the pictograph or pictogram, picture symbols are used to illustrate or represent the data under consideration. Usually, in
depicting population data, the figures of persons are used or the data on car sales. For example, drawing of cars is used or
illustration.
MEASURES OF CENTRAL TENDENCY
Measures of Central Tendency are numerical descriptive measures which indicate or locate the center of a distribution or data set.
In laymans term, a measure of central tendency is an average. It is a single number of values which can be considered typical in a
set of data as a whole.
Example: In a class of 40 students, the average height would be the typical height of the members of this class as a whole.
More clearly, this average height would normally be the height of the students in the class.
Three measures of central tendency are:
1. The Mean
2. The Median
3. The Mode
MEAN
A. Ungrouped or Raw Data
Among the three measures of central tendency, the mean is the most popular and widely used. It is sometimes called the
arithmetic mean.
The mean of a set of values or measurements is the sum of all the measurements divided by the number of measurements in the
set.
If we get or compute the mean of the population, we call it the parametric or population mean, ands it is denoted by the
symbol (read,mu). If we get the mean of the sample, we call it the sample mean and it is denoted by x (read, x bar). In our
discussion, we shall assume that we are always dealing with sample unless otherwise specified.
1. Sample Mean
For ungrouped or raw data, the mean has the3 following formula.
N
X
X
=
where X = mean
W
XW
, where X = mean
X = measurement or value
W = weight
Example 1: Below are Marias subjects and the corresponding number of units and grades she got for the first grading period.
Compute her grade point average (GPA)
13
Subject Units Grade
Math 1 80
English 1 82
Filipino 1 83
Science 2 81
Social Studies 1 80
PEHM 1.5 85
Technology and HE 1 82
Solution:
X =
W
XW
= 81.94
Therefore, Maria has the GPA of 81.94 for the first grading period.
B. Grouped Data
Recall that the grouped data are data which have been arranged in a frequency distribution table. To compute the mean for
grouped data, we can use two formulas, namely:
1. The Classmark Formula
2. The Coded Formula
1. The Classmark Formula
The formula for the mean using the classmark is as follows:
X =
N
X f
m
m
X f = 1828.0
X =
N
X f
m
= 7 . 45
40
1828
= . This indicates that the mean score in Mathematics of the group of 40
students id 45.7.
2. Coded Formula
The computation of the mean can be facilitated by using the coded formula which is shown below:
i
N
d f
X X
am
|
|
|
.
|
\
|
+ =
, where X = mean
am
X = assumed mean
f = frequency
d = coded deviation
N = total frequency
i = class size
Example 1: Consider the data in the preceding example.
Class Interval f d fd
16-23 1 -3 -3
24-31 3 -2 -6
32-39 6 -1 -6
40-47 12 0 0
48-55 10 1 10
56-63 8 2 16
N = 40
= 11 d f
14
i
N
d f
X X
am
|
|
|
.
|
\
|
+ =
X = 43.5 + 8
40
11
|
.
|
\
|
X = 43.5 2.2
X = 45.7
Characteristics of the Mean
1. The mean is the most appropriate measure of central tendency when the data are in the interval or ratio scale.
2. The mean lies between the largest and smallest values or measurements.
3. There is only one value for the mean for a given set of values or measurements.
4. The mean is easily influenced by the extreme values because all values contribute to the average. If there are high values, the
mean tends to be high also. If there are extremely low values the mean tends to below also.
Exercise 1:
1. A researcher wants to determine the average age of working students. A random sample of 10 students working in one of the
popular fast foods was asked about their ages. The following ages were gathered: 18, 20, 21, 18, 20, 18, 22, 24, 27, and 25.
a. compute the mean age.
b. Interpret the result.
2. For the past 8 weeks of operation, the ABC Orange Drinkers Company recorded the following consumption of nitric acid ( in
hundred liters) : 20, 18.5, 10.6, 18.7, 12.2, 18.4, 21.7, and 13.8.Compute the mean and interpret the result.
3. A certain cinema house recorded the number of moviegoers who watch movies everyday. The following are the results for the
past 7 days of observations.
Day 1 2 3 4 5 6 7
No. of Moviegoers 1150 2180 2000 2150 2500 8000 3520
a. Compute the mean
b. Interpret the result.
4. Jun obtained the following marks in his five subjects for the second grading period. Compute his grade point average.
Subject Units Grade
Math 3 80
English 3 85
Science 5 83
Filipino 3 84
Social Studies 3 88
5. The following are the test scores obtained by III-I students in Statistics. Compute the mean using the
a. Classmark Formula
b. Coded Formula
What is the average score obtained by the students?
Class Interval f
20-24 4
25-29 6
30-34 7
35-39 10
40-44 5
45-49 8
MEDIAN
Median is the middle value of a given set of measurements, provided that the values or measurements are arranged in an array. An
array is an arrangement of values in increasing or decreasing order.
A. Ungrouped Data
To find the median of ungrouped data, we first arrange the values or measurements in an array (either increasing or
decreasing order) and then get the middle value.
Example1: The following are the ages of the mathematics teachers in San Juan Elementary School: 21, 23, 32, 28, 25, 50, and 48.
Compute the median.
Solution: We first arrange the data in an array:
21, 23, 25, 28, 32, 48 and 50.
Then, we get the middle score, which is 28. Hence, the median is 28.
Example 2: In an English test, 8 students obtained the following scores: 10, 15, 12, 18, 16, 20, 12, and 14. Find the median.
15
Solution: We first arrange the scores in an array, that is:
10, 12, 12, 14, 15, 16, 18 and 20.
Since the number of scores is even (8 scores), there are two middlemost scores; namely, 14 and 15. To get the median,
we get the mean of the two middlemost scores. Thus, the median is
5 . 14
2
15 14
=
+
B. Grouped Data
For grouped data, we have the following formula in finding the median:
i
f
cf
N
l X
m
|
|
|
|
.
|
\
|
<
+ =
2
~
where, X
~
= median
l = lower class boundary of the median class
N = total frequency
<cf = less than cumulative frequency above the median class
i = size of the class interval
f
m
= frequency of the median class
Steps:
1. Construct the less than cumulative frequency.
2. Determine the median class. This is the class interval containing one-half of the total frequency
2
N
in the less than cumulative
frequency column. The less than cumulative frequency (<cf) is constructed by adding frequencies successively starting from the
lowest class interval.
3. Use the formula to find the median.
Example 1:
Class Interval f <cf
16-23 1 1
24-31 3 4
32-39 6 10
40-47 12 22
48-55 10 32
56-62 8 40
n = 40
In the above frequency distribution, the class interval 40-47 is the median class because it contains one-half of the total
frequency
|
.
|
\
|
= = 20
2
40
2
N
in the <cf column.
Notice that the lower class boundary of the median class is 39.5, the frequency of the median class is 12, and the less than
cf above the median class is 10. Substituting the values in the formula, we have:
i
f
cf
N
l X
m
|
|
|
|
.
|
\
|
<
+ =
2
~
= 39.5 + 8
12
10 20
|
.
|
\
|
= 46.17
Observe also that the size of the class interval is 8.
Characteristics of the Median:
1. The median is the most appropriate measure of central tendency for interval data.
2. The median lies between the highest and lowest measurements.
3. There is only one value for the median in a given set of measurements.
4. The median is not influenced by extreme values.
5. The median is used when the middle values is desired. It is the value where 50% or half of the distribution lies above it and 50%
lies below it.
Exercise 2:
1. In a survey of small business in Tondo, 10 bakeries report the following numbers of employees: 15, 14, 12, 19, 13, 14, 15, 18, 13
and 19.Compute the median and interpret the results.
2. The random saving of third year high school students reveals the following current balances in their bank accounts:
Students A B C D E F G H
Current Balances P340 P180 P140 P360 P180 P170 P340 P290
Compute the median and interpret the result.
16
3. The following are the lifetimes of 9 light bulbs in thousands of hours.
Light bulb A B C D E F G H I
Lifetime 1.1 1.1 1.2 1.1 1.4 .9 .2 1.2 1.7
4. The table below shows the age distribution of the contestants in a raffle draw sponsored by a popular noon time game show.
Age f
25-29 12
30-34 7
35-39 3
40-44 6
45-49 10
50-54 8
55-59 4
Compute the median and interpret the result.
5. The distribution of the profit earned by 50 sari-sari store owners in one week is shown below:
Profit f
200-249 6
250-299 4
300-349 12
350-399 12
400-449 10
450-499 6
Compute the median and interpret the result.
MODE
Mode is the value which occurs most frequently in a set of measurement or values.
*It is the least common among the three measures of central tendency. However, it is very useful as a measure of
popularity. For example, we might be interested in determining the most popular TV show, most preferred brand of toothpaste or
the most favorite ice cream, the most saleable brand of shoes. In these situations, the mode is the most appropriate measure of
central tendency.*
A. Ungrouped Data
The mode for ungrouped data is fairly easy to find. It is just the value or measurement which occurs the most number of
times. In other words, it is the most popular value.
A distribution may have only one mode. In this case, the distribution is said to be unimodal. Data that have two values for
the mode is said to be bimodal. It is also possible that the set of data is multimodal if there are more than two values for the mode.
Example 1: The data on the number of times 10 mothers go to market every week are shown below:
Mother A B C D E F G H I J
No. of times mother goes to market 2 1 3 3 1 3 2 3 3 1
Find the mode.
Solution: The mode is 3. This means that the majority of the mothers go to market three times a week.
Example 2: Find the mode of the following measurements: 20, 15, 20, 14, 18, 15, 6
Solution: The modes are 20 and 15. Hence, the set of data is bimodal.
B. Grouped Data
For grouped data, we have the following formula to find the mode:
i
f f f
f f
l M
b a m
a m
m o
|
|
.
|
\
|
+ =
2
whereM
o
= Mode
l
m
= lower class boundary of the modal class
f
m
= frequency of the modal class
f
a
= frequency of the modal class
f
b
= frequency below the modal class
i = size of the class interval
Steps:
1. Find the modal class. This is the class interval with the highest frequency.
2. Use the formula to find the mode.
*It is important to note that the formula for the mode given above holds only for unimodal distribution. For
multimodal distribution, the rough mode is given by the formula
Mode = 3(Median) 2(Mean)
Let us use the distribution of scores of 40 students in Mathematics to illustrate how to compute the mode for grouped data.
Example 1: Find the mode of the data whose frequency distribution is given below
17
Class Interval f
16-23 1
24-31 3
32-39 6
40-47 12
48-55 10
56-63 8
n = 40
Notice that the class intervals are arranged from lowest to highest group. The modal class is the class interval 40-47. The
lower class boundary of the modal class is 39.5, the frequency of the modal class is 12, the frequency above the modal class is 6, the
frequency below the modal class is 10, and the size pf the class interval is 8. Substituting these values in the formula is, we have
i
f f f
f f
l M
b a m
a m
m o
|
|
.
|
\
|
+ =
2
= 39.5 + 8
10 6 ) 12 ( 2
6 12
|
|
.
|
\
|
= 45.5
Characteristics of the Mode:
1. The mode is the most appropriate measure of central tendency when the data are nominal in scale.
2. The mode is the least reliable among the three measures of central tendency because its value is undefined in some distributions.
3. The mode is used when we want to find the value which occurs most often.
4. The mode is a quick approximation of the average. The mode is sometimes referred to as an inspection average.
Exercise3:
1. The following table shows the frequency of errors committed by 10 typists per minute.
Typist A B C D E F G H I J
No. of Errors Per Minute 5 3 3 7 2 8 8 7 7 10
Find the mode and interpret it.
2. A random sample of 8 mango trees reveals the following number of fruits they yield.
Mango Tree A B C D E F G H
No. of Fruits 80 70 80 90 82 82 90 82
Find the mode and interpret it.
3. Find the mode.
The following are the scores of 9 students in Mathematics quiz. Compute the mode.
12, 15, 12, 8, 7, 15, 19, 24, 13
4. For 50 days, Pedro recorded the number of cars passing by their street from 10:00 oclock AM to 12:00 oclock AM. The foll owing
table shows the distribution.
No. of cars
f
40-44 3
45-49 10
50-54 13
55-59 9
60-64 8
65-69 7
Find the mode and interpret the result.
5. Find the mode of the following frequency distribution:
Class Interval f
100-109 5
110-119 10
120-129 8
130-139 15
140-149 12
150-159 10
OTHER MEASURES OF LOCATION
The measure of central tendency is a measure of location. It indicates the center of a given data. Other descriptive
measures which are used to locate the position of values or scores in the distribution are quartiles, deciles, and percentiles.
Recall that the median is the value where 50% of the distribution falls or lies above it while 50% of the distribution lies
below it. To illustrate, imagine that the line segment below represents the distribution. The midpoint of the line segment is the
median.
In other words, the median is the value which divides the distribution into two equal parts. We define the quartiles, deciles
and percentiles in a similar manner. These descriptive measures-quartiles, deciles and percentiles-are called fractiles.
Quartiles are values which divide the distribution into four equal parts.
Deciles are values which divide the distribution into ten equal parts.
18
Percentiles are values which divide the distribution into 100 equal parts.
To illustrate the meaning of the fractiles, let us again consider the line segment to represent the distribution of
measurements:
The first quartile (Q
1
) is the value where 25% of the distribution lies below it while 75% of the distribution lies above it. The
third quartile (Q
3
) is the value where the 75% of the distribution lies below it while 25% of the distribution lies above it. Observe that
the median is equal to the second quartile (Q
2
).
The deciles and percentiles are interpreted in the same way as the quartiles. For example the sixth decile (D6) is the value
where 60% of the distribution lies below it while 40% of the distribution lies above it. Which decile is equivalent to the median?
Which percentile is equivalent to the median?
MEASURES OF VARIABILITY, SKEWNESS, AND KURTOSIS
A. Measures of Variability for Ungrouped Data
Plain information about the measures of central tendency of a given set of data does not suffice to fully describe the data.
Additional information such as the amount of variation among data is important. Thus, we need to talk about measures of
variability.
Measures of variability or dispersion are measures of the average distance of each observation from the center of the distribution.
They measure the homogeneity or heterogeneity of a particular group.
A measure of variability would indicate that the data are
1. clustered closely around the mean;
2. more homogenous;
3. less variable;
4. more consistent and;
5. more uniformly distributed.
Consider the following sets of grades in Mathematics of two groups of 5 students each:
Male Group Female Group
Juan: 70 Juana: 82
Mario: 95 Maria:80
Antonio: 60 Antonia:83
Pedro: 80 Petra: 81
Jose: 100 Jesusa: 79
Mean: 81 Mean: 81
The mean of grade of both groups is 81. By just looking at their mean grade, we can only conclude that both groups
performed equally well in the said test, but this does not explain how far apart the grades are from one another. Let us picture the
position of each grade in a number line.
Males:
Females:
Notice that the grades of the males are far apart from each other, while the grades of the females are more compressed or
clustered together. Thus, the measure of the center of the distribution is of little help in describing and comparing these two sets of
data. By getting the average distance of each item from the center of the distribution, the group can be described more compl etely
and, likewise similarities and differences can be easily identified.
There are several measures of variability r dispersion. Among the, are the range, mean absolute deviation, variance and
standard deviation, to name a few.
1. THE RANGE this is the simplest measure of variability among data.
Range is the difference between the highest and lowest values. This is the simplest but the most unreliable measure of
variability since it uses only two values in the distribution.
The formula for finding the range is shown below:
v v
L H R =
where R = range
H
v
= highest value and
L
v
= lowest value
Example 1: Find the range of the grades in Math of the two groups of students in the preceding example.
Solution: Male: 70 95 60 80 100
v v
L H R = ; R = 100 60 = 40
Female: 82 80 83 81 79
v v
L H R = ; R = 83 79 = 4
19
The range of grades of the male groups is 40 while that of the female groups is 4. This shows that the grades of the males
are scattered whole the grades of the female group are close to each other. It shows further that females are more homogenous
than the males in their math ability.
However, because of the simplicity to calculate the range has the following disadvantages:
1. For a very large sample, it is an unstable descriptive measure of dispersion;
2. Since only two values are used in the computation, the range is an unreliable measure of dispersion;
3. The range of two sets of data composed of different numbers of samples is not directly comparable.
2. THE MEAN ABSOLUTE DEVIATION- A more reliable measure of variability takes into account all the data in the given distribution.
One of them is the mean absolute deviation (MAD).
Example 1:
Male group:
X
X X X
( )
2
X X
70 81 -11 121
95 81 14 196
60 81 -21 441
80 81 -1 1
100 81 19 361
( ) 1120
2
=
X X
a. Treating the data as population, the variance is:
( )
units square
N
X X
224
5
1120
2
2
= =
= o
b. Treating the data as sample, the variance is:
( )
units square
N
X X
s 280
4
1120
1 5
1120
1
2
2
= =
=
Female Group
X
X X X
( )
2
X X
82 81 1 1
80 81 -1 1
83 81 2 4
81 81 0 0
79 81 -2 4
( ) 10
2
=
X X
a. Treating the data as population, the variance is:
( )
units square
N
X X
2
5
10
2
2
= =
= o
b. Treating the data as sample, the variance is:
( )
units square
N
X X
s 5 . 2
4
10
1 5
10
1
2
2
= =
=
Using the variance as a measure of variability for the two sets of grades, the males showed more variability in performance.
Note that the higher the variance, the more variable or far apart are the values from each other.
3. THE STANDARD DEVIATION
Notice from the preceding discussion that the obtained variance is in squared units since we squared the deviation from the
mean, thus, it does not reflect the true meaning of the data being measured. To bring back the result to its original unit, we have to
get the square root of the variance resulting in another measure of variability called the standard deviation.
Standard deviation is the square root of the average deviation from the mean, or simply the square root of the
variance.
For ungrouped data, the formulas for finding the standard deviation are shown below:
Population standard deviation:
( )
N
X X
= o
2
20
Sample standard deviation:
( )
1
2
=
N
X X
s
Thus, the standard deviations of the two sets of grades are as follows:
For the Male Group:
( )
N
X X
= o
2
=
5
1120
= 14.97 units
( )
1
2
=
N
X X
s =
4
1120
= 16.73 units
For the Female Group:
( )
N
X X
= o
2
=
5
10
= 1.41 units
( )
1
2
=
N
X X
s =
4
10
= 1.58 units
Again, we see that the scores of males are more spread out than those of the females.
Example 1: The following are the monthly incomes of 8 families living in Block 18 of ABC Subdivision:
P28, 000.00 P35, 000.00 P40, 000.00 P50, 000.00
P33, 000.00 P39, 000.00 P41, 000.00 P48, 000.00
a. Use the formulas to find the variance and the standard deviation.
b. Interpret the result.
Example 2: The following are the number of assists made by 2 point guards in the PBA on 10 randomly selected games in the 1999
season.
Table 1
Number of Assists Made by Two Point Guards in the 1999 PBA Season
Game Point Guard (A) Point Guard (B)
Subject:Attention:1 12 8
2 6 10
3 13 9
4 2 12
5 5 5
6 0 1
7 8 4
8 6 7
9 10 9
10 7 3
a. Use the formulas to find the variance and the standard deviation.
b. Interpret the result.
Standard deviation and variance are both reliable measure of variability or spread of the distribution. However, we cannot
use them in comparing two sets of data of different units. For instance, it would be difficult to identify in which area a particular
player show consistency- in giving assists or in making points- if we only know either variance or the standard. Problem of this nature
can be answered using the coefficient of variation.
Coefficient of variation is the ratio of the standard deviation to the mean. It is used to compare the variability of two or
more sets of data even when they are expressed in different unit of measurement. The formula for the coefficient of
variation is shown below:
X
s
cv =
where cv = coefficient of variation
s = standard deviation
X = mean
Our previous example showed that Point Guard B is more consistent in giving assists than Point Guard A as shown by their
respective standard deviations. Consider the table below showing Point Guard Bs record of assists and points in the randomly-
selected games which he played. Let us determine in which area he performed consistently.
21
Table 2
Number of Assists and the Number of Points Made by Point Guard B
in 10 Randomly-Selected Games in the 1999 PBA Season
Game Number of Assists Number of Points
1 8 18
2 10 20
3 9 22
4 12 16
5 5 35
6 1 12
7 4 23
8 7 25
9 9 30
10 3 15
The table below summarizes the measurements obtained from the two sets by using the appropriate formulas.
Table 3
Measurements of Assists and Points of Point Guard B
X
1
o
n
X
cv
Assists 6.8 3.46 0.5088
Points 21.6 7.04 0.3259
By just referring to the standard deviation, we cannot say that this particular player performed consistently in giving assists
rather than in making points, although the number of assists has a smaller standard deviation. These are two different areas with
different units and are not comparable. However, if we convert the standard deviation as a percentage of its mean, then we can now
tell which area he performed consistently. We can do this by solving for their respective coefficients of variation. Thus:
1. Coefficient of variation for giving assists:
5088 . 0
8 . 6
46 . 3
= = =
X
s
cv
A
A
2. Coefficient of variation for making points:
3259 . 0
6 . 21
06 . 7
= =
P
P
P
X
s
cv
Using the coefficient of variation as basis, we can now conclude that Point Guard B is more consistent in making points
since the coefficient of variation of making points is smaller(0.3259 or 32.59%) than that of giving assists ( 0.5088 or 50.88%).
Exercise I
1. Find the range, mean absolute deviation, variance, and standard deviation of the following sets of data:
a. 8 6 5 2 15 9 7
b. 6.5 3.5 4.0 9.3 8.2 6.6 9.5 10
c. 15 18 35 19 41 49 55 63 12
2. The following are the ages of the first 10 female visitors of Anas birthday:
8 19 43 25 6 15 22 50 18 20
Find the following measurements and interpret each result.
a. Range c. Standard Deviation
b. Variance d. Coefficient of Variation
3. The following are the scores of two groups of students who took the make-up test in Arithmetic:
Group A Group B
62 75
58 42
65 35
43 53
72 66
a. Find the following: mean, range, variance, and standard deviation.
b. Answer the following:
i. Which group performed better in the test?
ii. Which group has a more uniform set of scores?
iii. Which group shows more variability in scores?
4. The mean weight of 10 boxes of a certain brand of biscuits is 358 grams, with a standard deviation of 8.65 grams. The biscuits
were purchased from 10 different stores and the average price is P51.60 with a standard deviation of P3.60. Which is more variable,
the weight or the price? Why?
22
B. Measures of Variability for Grouped Data
Earlier, we learned that grouping result in loss of information, and maybe, distortion of some vital information. Thus, if
possible, we should get the needed measurements from ungrouped data. If, however, we cannot avoid using grouped data, then the
following formulas are suggested:
1. Range: R = UL LL, where UL = upper limit and LL = lower limit.
2. Standard deviation: , for population
( )
1
2
=
N
X X f
s , for sample
where X is the class mark
X is the mean
fis the frequency
Example: Consider the following table and compute the standard deviation of the scores.
Table 4
Grouped Frequency Distribution
For the Entrance Exam Scores of 60 Students
Class Interval X F
18-23 20.5 6
24-29 26.5 11
30-35 32.5 17
36-41 38.5 14
42-47 44.5 8
48-53 50.5 3
54-59 56.5 1
N = 60
Since we do not know exactly which students score falls into each class, we will use the class marks and assume that all the
6 students in the class interval 18-23 has a score of 20.5, all the 11 students in the class interval 24-29 has a score of 26.5, and so on.
Computing for the mean of this set of data, we find X = 34.5
Now, by using the formula for the standard deviation, computation goes as follows:
Class Interval X
X - X
( )
2
X X
f
( )
2
X X f
18-23 20.5 -14 196 6 1176
24-29 26.5 -8 64 11 704
30-35 32.5 -2 4 17 68
36-41 38.5 4 16 14 224
42-47 44.5 10 100 8 800
48-53 50.5 16 256 3 768
54-59 56.5 22 484 1 484
N = 60
( ) 4224
2
=
X X f
a. Solving for the population standard deviation, we have:
( )
N
X X f
= o
2
=
60
4224
= 4 . 70 = 8.390470785 ~ 8.39
b. Solving for the sample standard deviation, we obtain:
( )
1
2
=
N
X X f
s =
59
4224
= 59322034 . 71 = 8.461277701 ~ 8.46
Exercise 2:
For the given tables below, compute for the following measurements:
1. Mean 5. Variance
2. Median 6. Standard deviation
3. Mode 7. Coefficient of variation
4. Range
( )
N
X X f
= o
2
23
A.
Frequency Distribution for Hourly Rate
Of Workers in XYZ Factory
Class Interval f
100-119 4
120-139 10
140-159 12
160-179 17
180-199 25
200-219 8
220-239 5
240-259 4
N = 85
B.
Frequency Distribution for the Mathematics Test
Scores of 50 First Year Students
Class Interval f
20-24 4
25-29 7
30-34 12
35-39 10
40-44 9
50-54 2
N = 50
C. Measures of Skewness
In the previous chapters, we discussed several measurements, which somehow describe the type of distribution we have.
The measures of central tendency tell us where the center of the distribution is, but did not mention how many are on the left or on
the right of the center. The measures of variability tell us the average distance from the center but do not give us the overall picture
of the distribution. In order to have an overview of how the data behave in relation to its center, we shall talk about a normal
distribution and the measures of skewness and kurtosis.
Normal distribution is a distribution with a bell-shaped appearance. In a normal distribution, the mean = median =
mode.
In symbols
o
M X X = =
~
Consider the data in Table 1 and the corresponding representation of the data. This distribution is an example of a normal
distribution.
Table 1
Distribution of Correct Answers of 19 Students
Who Participated in a Math Contest
No. of Correct Answers f
1 1
2 2
3 4
4 5
5 4
6 2
7 1
N = 19
3. The following are the scores of two groups of students who took the make-up test in Arithmetic:
Group A Group B
62 75
58 42
65 35
43 53
72 66
a. Find the following: mean, range, variance, and standard deviation.
b. Answer the following:
i. Which group performed better in the test?
ii. Which group has a more uniform set of scores?
iii. Which group shows more variability in scores?
24
4. The mean weight of 10 boxes of a certain brand of biscuits is 358 grams, with a standard deviation of 8.65 grams. The biscuits
were purchased from 10 different stores and the average price is P51.60 with a standard deviation of P3.60. Which is more variable,
the weight or the price? Why?
B. Measures of Variability for Grouped Data
Earlier, we learned that grouping result in loss of information, and maybe, distortion of some vital information. Thus, if
possible, we should get the needed measurements from ungrouped data. If, however, we cannot avoid using grouped data, then the
following formulas are suggested:
1. Range: R = UL LL, where UL = upper limit and LL = lower limit.
2. Standard deviation: , for population
( )
1
2
=
N
X X f
s , for sample
where X is the class mark
X is the mean
fis the frequency
Example: Consider the following table and compute the standard deviation of the scores.
Table 4
Grouped Frequency Distribution
For the Entrance Exam Scores of 60 Students
Class Interval X F
18-23 20.5 6
24-29 26.5 11
30-35 32.5 17
36-41 38.5 14
42-47 44.5 8
48-53 50.5 3
54-59 56.5 1
N = 60
Since we do not know exactly which students score falls into each class, we will use the class marks and assume that all the
6 students in the class interval 18-23 has a score of 20.5, all the 11 students in the class interval 24-29 has a score of 26.5, and so on.
Computing for the mean of this set of data, we find X = 34.5
Now, by using the formula for the standard deviation, computation goes as follows:
Class Interval X
X - X
( )
2
X X
f
( )
2
X X f
18-23 20.5 -14 196 6 1176
24-29 26.5 -8 64 11 704
30-35 32.5 -2 4 17 68
36-41 38.5 4 16 14 224
42-47 44.5 10 100 8 800
48-53 50.5 16 256 3 768
54-59 56.5 22 484 1 484
N = 60
( ) 4224
2
=
X X f
a. Solving for the population standard deviation, we have:
( )
N
X X f
= o
2
=
60
4224
= 4 . 70 = 8.390470785 ~ 8.39
b. Solving for the sample standard deviation, we obtain:
( )
1
2
=
N
X X f
s =
59
4224
= 59322034 . 71 = 8.461277701 ~ 8.46
( )
N
X X f
= o
2
25
Exercise 2:
For the given tables below, compute for the following measurements:
1. Mean 5. Variance
2. Median 6. Standard deviation
3. Mode 7. Coefficient of variation
4. Range
A.
Frequency Distribution for Hourly Rate
Of Workers in XYZ Factory
Class Interval f
100-119 4
120-139 10
140-159 12
160-179 17
180-199 25
200-219 8
220-239 5
240-259 4
N = 85
B.
Frequency Distribution for the Mathematics Test
Scores of 50 First Year Students
Class Interval f
20-24 4
25-29 7
30-34 12
35-39 10
40-44 9
50-54 2
N = 50
C. Measures of Skewness
In the previous chapters, we discussed several measurements, which somehow describe the type of distribution we have.
The measures of central tendency tell us where the center of the distribution is, but did not mention how many are on the left or on
the right of the center. The measures of variability tell us the average distance from the center but do not give us the overall picture
of the distribution. In order to have an overview of how the data behave in relation to its center, we shall talk about a normal
distribution and the measures of skewness and kurtosis.
Normal distribution is a distribution with a bell-shaped appearance. In a normal distribution, the mean = median =
mode.
In symbols
o
M X X = =
~
Consider the data in Table 1 and the corresponding representation of the data. This distribution is an example of a normal
distribution.
Table 1
Distribution of Correct Answers of 19 Students
Who Participated in a Math Contest
No. of Correct Answers f
1 1
2 2
3 4
4 5
5 4
6 2
7 1
N = 19
Here, we have the following measurements:
1. Mean = 5.58 3. Mode = 6.0
2. Median = 6.0 4. Standard Deviation = 1.07
The graph is obviously not bell-shaped, with the tail going to the left. The bulk of the distribution is on the right. The mean
is less than the median. This implies that the questions are generally easy or that many students in the groups are bright.
26
Now, let us suppose that there are more students who scored lower than 4.0. Let us find out what will happen to the graph
of the distribution.
Table 3
Distribution of Correct Answers of 19 Students
Who Participated in a Math Contest
No. of Correct Answers f
1 3
2 9
3 4
4 2
5 1
6 0
7 0
N = 19
The distribution has mean = 2.4, median = 2.0, mode = 2.0, and standard deviation = 1.07.
The graph is not bell-shaped, with the tail going to the right. The bulk of the distribution is on the left. The mean is greater
than the median. This implies that the questions are difficult or that most students in the group are not prepared for the contest.
Obviously, the last two graphs indicate that the data in Tables 2 and 3 are not normally distributed. The distribution in each
case is said to be skewed.
Skewness refers to the degree of symmetry or asymmetry of a distribution.
A distribution is skewed to the left if the mean is less than its median. The bulk of the distribution is on the right.
This is otherwise known as negatively skewed.
A distribution in skewed to the right if the mean is greater than its median. The bulk of the distribution is on the left. This
is otherwise known as positively skewed.
In fact, the graph in Figure 2 is said to be negatively skewed or skewed to the left, while the graph in Figure 3 is said to be
positively skewed or skewed to the right. Thee extent of skewness can be obtained by getting the coefficient of skewness using the
formula:
deviation dard S
Median Mean
SK
t an
) ( 3
=
whereSK is the coefficient of skewness.
Let us summarize the measurements obtained from the 3 types of distribution.
Table 4
Summary of Measurements from the Three Distributions
Normal Skewed to the left Skewed to the right
Mean 4.00 5.58 2.40
Median 4.00 6.00 2.00
Mode 4.00 6.00 2.00
Standard deviation 1.53 1.07 1.07
Using the formula to find the coefficient of skewness, we have:
1. For normal distribution:
deviation dard S
Median Mean
SK
t an
) ( 3
= =
53 . 1
) 0 . 4 0 . 4 ( 3
= 0
2. For skewed to the left distribution:
deviation dard S
Median Mean
SK
t an
) ( 3
= =
07 . 1
) 0 . 6 6 . 5 ( 3
= -1.12
3. For skewed to the right distribution:
deviation dard S
Median Mean
SK
t an
) ( 3
= =
07 . 1
) 0 . 2 4 . 2 ( 3
= 1.12
Notice that if
1. SK = 0, the distribution is normal.
2. SK < 0, the distribution is skewed to the left.
3. SK > 0, the distribution is skewed to the right.
Example 1: Find the coefficient of skewness and indicate if the distribution is normal, skewed to the left or skewed to the
right.
72 81 67 83 61 75 78 82 71 67
27
Solution:
1. Find the means: Mean = 73.7
2. Find the median: Median = 73.5
3. Find the standard deviation: Std. Dev. = 7.38
4. Find
deviation dard S
Median Mean
SK
t an
) ( 3
= =
38 . 7
) 5 . 73 7 . 73 ( 3
= 0.08
5. Interpretation: Since SK is positive, then it is skewed to the right. But the value is too small, so we can say that the
distribution is almost normal.
Exercise1
A. Determine the coefficient of skewness, then indicate if the distribution is normal, skewed to the left, skewed to the right:
1. 130 87 18 56 70 239 87 98 157 87
2. 19 17 14 11 10 8 7 7
B. Describe the distribution of
1. a class which is composed mostly of bright students.
2. a class which is composed mostly of poor students.
3. a Grade VI spelling test administered to fourth year high school students.
4. a first year high school Math test administered to Grade II students.
C. Refer to the table and find the coefficient of skewness. Describe the distribution.
Table 5
Hourly Wages of 100
Skilled Workers in ABS Corporation
Hourly Wage F
140-159 7
160-179 20
180-199 33
200-219 25
220-239 11
240-259 4
N = 100
D. Measures of Kurtosis
Curves of a distribution may have the same coefficient of skewness but may still differ significantly in some aspects.
Measures of central tend3ency, variability, and skewness do not tell us anything about the peakedness or flatness of a distribution.
Let us look at the three graphs shown in Figure 4
Distribution A is platykurtic, distribution B is normal or mesokurtic, while distribution C is leptokurtic.
Kurtosis refers to the peakedness or flatness of a distribution.
Mesokurticis a normal distribution.
Leptokurtic is more peaked than the normal distribution.
Platykurtic is flatter than the normal distribution.
Kurtosis (Ku) is obtained using the following formulas:
( )
4
4
Ns
X X
Ku
= , for ungrouped data
( )
4
4
Ns
X X f
Ku
m
= , for grouped data
whereKu = is the kurtosis
X = is the raw data
X
m
= is the class mark
X = is the mean
s
4
= is the square of the variance
N = is the sample size
A distribution is normal or mesokurtic if Ku = 3, leptokurtic if Ku> 3, and platykurtic if Ku <3.
Example. In 2, 3, 3, 4, 4, 4, 5, 5, 6
X = 4.0 X = 4.0 s = 1.22 ( )
4
X X = 36 and
4
Ns = 19.9380
Substituting to the formula, we have
( )
4
4
Ns
X X
Ku
= =
9380 . 19
36
= 1.81 ~ 2.0
28
Therefore, the distribution is platykurtic.
The measures of skewness and kurtosis show the extent of departure of a given distribution from normality and allow
comparison of two or more distributions.
Exercise 2
1. The number of packs of cigarettes Mang Juan sold during the last 12 days of December are as follows:
10,15,5,21,7,25,90,14,18,20,10,12
Determine the following and interpret each result:
a. Range
b. Standard deviation
c. Coefficient of variation
d. Kurtosis
2. The following are the number of minutes used by 13 students in answering a particular problem in Statistics.
58,24,31,28,27,2738,23,28,29,20,24,21
Determine the following and interpret the result:
a. Mean absolute deviation
b. Standard deviation
c. Coefficient of skewness
d. Kurtosis
3. Table 6 shows the scores obtained by 80 applicants for secretarial position in a certain manufacturing company. Determine the
following and interpret each result:
a. Coefficient of skewness
b. Kurtosis Table 6
Scores of 80 Applicants
For a Secretarial Position
Scores f
10-18 2
19-27 3
28-36 1
37-45 7
46-54 22
55-63 15
64-72 18
73-81 9
82-90 3
N = 80