Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
69 views

Statistics Notes Part 1

1. Statistics involves collecting, organizing, analyzing, and interpreting data to understand patterns and make informed decisions. A variable is a characteristic that can take different values. Univariate variables involve a single characteristic. 2. Common graphical representations in statistics include bar charts, histograms, pie charts, line charts, scatter plots, and box-and-whisker plots. These visualizations help show patterns, relationships, and the distribution of data. 3. Examples demonstrate how to construct different graphs like histograms, pie charts, and box-and-whisker plots from sample data. Frequency distribution tables are used to organize data before graphing. Graphical representations highlight key findings and relationships in data.

Uploaded by

Lukong Louis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Statistics Notes Part 1

1. Statistics involves collecting, organizing, analyzing, and interpreting data to understand patterns and make informed decisions. A variable is a characteristic that can take different values. Univariate variables involve a single characteristic. 2. Common graphical representations in statistics include bar charts, histograms, pie charts, line charts, scatter plots, and box-and-whisker plots. These visualizations help show patterns, relationships, and the distribution of data. 3. Examples demonstrate how to construct different graphs like histograms, pie charts, and box-and-whisker plots from sample data. Frequency distribution tables are used to organize data before graphing. Graphical representations highlight key findings and relationships in data.

Uploaded by

Lukong Louis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

GROUPE TANKOU ENSEIGNEMENT SUPERIEUR (GTES)

Department: AC1/HND1
Course Title: Statistics 1
Semester: First, Course Instructor: LUKONG Louis Y.

1 Statistical Series to a Variable


Definitions and Vocabulary

Statistics is a branch of mathematics that involves collecting, organizing, analyzing, interpreting, presenting, and
drawing conclusions from data. It provides methods for making informed decisions in the face of uncertainty by
using data to understand patterns, relationships, and trends in various phenomena. In essence, statistics helps us
make sense of information and make informed decisions based on evidence and probability.

Variable: In statistics, a variable is a characteristic or attribute that can take on different values. It represents a
feature of interest in a study or experiment.

Univariate Variable: A variable that involves only one characteristic, attribute, or measurement. Analysis of
univariate variables helps in understanding the distribution and patterns within a single variable.

Data Point: A single observation or measurement of a variable. In a dataset, each row typically corresponds to a
data point.

Population: The entire set of individuals, objects, or measurements that the researcher is interested in studying.
It represents the larger group to which study results are intended to be applied.

Sample: A subset of the population selected for a particular study. The goal is for the sample to be representative
of the larger population.

Parameter: A characteristic or measure that describes a population. For example, the population mean or standard
deviation.

Statistic: A characteristic or measure that describes a sample. For example, the sample mean or sample standard
deviation.

Frequency: The number of times a particular value occurs in a dataset. Frequency is often used to construct
frequency distributions.

Distribution: The way in which values of a variable are spread or distributed. Common distributions include
normal, skewed, and uniform distributions.
Page | 1
Central Tendency: A measure that describes the center or average of a distribution. Common measures include
the mean, median, and mode.

Measures of Dispersion: Statistics that quantify the spread or dispersion of a dataset. Common measures include
the range, variance, and standard deviation.

Range: The difference between the maximum and minimum values in a dataset. It provides a simple measure of
the spread of data.

Graphical Representations
In statistics, graphical representation involves visually presenting data to provide insights into patterns, trends,
and relationships. Here are key elements and types of graphical representation commonly used in statistics:

i. Bar Charts:

Description: Bar charts represent data using rectangular bars of lengths proportional to the v alues they represent.
They are useful for comparing the sizes of different categories.

Use Cases: Displaying categorical data, comparing frequencies, illustrating distribution.

ii. Histograms:

Description: Similar to bar charts, histograms represent the distribution of a dataset. However, histograms are
used for continuous data and group data into intervals (bins).

Use Cases: Showing the frequency distribution of a continuous variable.

iii. Pie Charts:

Description: Pie charts represent data as slices of a circle, with each slice corresponding to a category and its size
proportional to the quantity it represents.

Page | 2
Use Cases: Displaying the composition of a whole, illustrating proportions.

iv. Line Charts:

Description: Line charts connect data points with lines, showing the trend or pattern of a variable over a
continuous range.

Use Cases: Displaying trends, patterns, or changes over time.

v. Scatter Plots:

Description: Scatter plots display individual data points on a two-dimensional graph, with one variable on the x-
axis and another on the y-axis. They are useful for showing relationships between two continuous variables.

Use Cases: Exploring relationships, identifying patterns, detecting outliers.

vi. Box-and-Whisker Plots (Boxplots):

Description: A plot that shows the center, spread, and skewness of a data set. It is constructed by drawing a box
and two whiskers that use the median, the first quartile, the third quartile, and the smallest and the largest values
in the data set between the lower and the upper inner fences. Boxplots display the distribution of a dataset using
quartiles. The box represents the interquartile range (IQR), and whiskers extend to show variability outside the
IQR.

Use Cases: Illustrating the spread of data, identifying outliers.

Example 1: The following table shows the numbers of agricultural and non-agricultural workers for the year
1860–1950. Graph the data using (a) line graphs, (b) bar charts. Add small notes on the diagrams.

Page | 3
Solution:

Fig: Line Graphs showing Agricultural and Non-agricultural Workers (1860-1950)

Fig: Bar Charts showing Agricultural and Non-agricultural Workers (1860-1950)

Exercise: The following table gives the average approximate yield of rice in lbs. per acre in various countries of
the world in 1938–39:

Indicate this by a suitable diagram which will highlight the relative backwardness of India in this regard.

Example 2: Construct a pie chart for the following data: Principal Exporting Countries of Cotton (1,000 bales)—
1955–56

Page | 4
Solution:
Calculations for Pie Chart.

Fig: Pie Chart showing Principal Exporting Countries of Cotton (1955-56)


Exercise: Draw a pie chart to represent the following data relating to the production cost of a manufacturer:

Cost of Material $18,360


Cost of Labour $13,524
Direct Expenses $3,672
Overhead $7,344

Example 3: Uncle Bruno owns a garden with 30 black cherry trees. Each tree is of a different height. The height
of the trees (in inches): 61, 63, 64, 66, 68, 69, 71, 71.5, 72, 72.5, 73, 73.5, 74, 74.5, 76, 76.2, 76.5, 77, 77.5, 78,
78.5, 79, 79.2, 80, 81, 82, 83, 84, 85, 87. We can group the data as follows in a frequency distribution table by
setting a range:

Height Range (ft) Number of Trees (Frequency)


60 - 65 3
66 - 70 3
71 - 75 8
76 - 80 10
81 - 85 5
86 - 90 1
This data can be now shown using a histogram. We need to make sure that while plotting a histogram, there
shouldn’t be any gaps between the bars.

Page | 5
Exercise: In a hospital, there are 20 newborn babies whose ages (in days) in increasing order are as follows: 1, 1,
1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 5. Construct a frequency distribution table and proceed to draw histogram
for the data.

Example 04: The following data are the incomes (in thousands of dollars) for a sample of 12 households. 75, 69,
84, 112, 74, 104, 81, 90, 94, 144, 79, 98. Construct a box-and-whisker plot for these data.

Solution. The following five steps are performed to construct a box-and-whisker plot.

Step 1. First, rank the data in increasing order and calculate the values of the median, the first quartile, the third
quartile, and the interquartile range. The ranked data are 69, 74, 75, 79, 81, 84, 90, 94, 98, 104, 112, 144. For
these data,

𝑀𝑒𝑑𝑖𝑎𝑛 = (84 + 90) ∕ 2 = 87


𝑄1 = (75 + 79) ∕ 2 = 77
𝑄3 = (98 + 104) ∕ 2 = 101
𝐼𝑄𝑅 = 𝑄3 − 𝑄1 = 101 − 77 = 24
Step 2. Find the points that are 1.5 × IQR below Q1 and 1.5 × IQR above Q3. These two points are called the
lower and the upper inner fences, respectively.

1.5 × 𝐼𝑄𝑅 = 1.5 × 24 = 36


𝐿𝑜𝑤𝑒𝑟 𝑖𝑛𝑛𝑒𝑟 𝑓𝑒𝑛𝑐𝑒 = 𝑄1 − 36 = 77 − 36 = 41
𝑈𝑝𝑝𝑒𝑟 𝑖𝑛𝑛𝑒𝑟 𝑓𝑒𝑛𝑐𝑒 = 𝑄3 + 36 = 101 + 36 = 137

Step 3. Determine the smallest and the largest values in the given data set within the two inner fences. These two
values for our example are as follows:

Smallest value within the two inner fences = 69

Page | 6
Largest value within the two inner fences = 112

Step 4. Draw a horizontal line and mark the income levels on it such that all the values in the given data set are
covered. Above the horizontal line, draw a box with its left side at the position of the first quartile and the right
side at the position of the third quartile. Inside the box, draw a vertical line at the position of the median. The
result of this step is shown below:

Step 5. By drawing two lines, join the points of the smallest and the largest values within the two inner fences to
the box. These values are 69 and 112 in this example as listed in Step 3. The two lines that join the box to these
two values are called whiskers. A value that falls outside the two inner fences is shown by marking an asterisk
and is called an outlier. This completes the box-and-whisker plot, as shown

In the figure above, about 50% of the data values fall within the box, about 25% of the values fall on the left side
of the box, and about 25% fall on the right side of the box. Also, 50% of the values fall on the left side of the
median and 50% lie on the right side of the median. The data of this example are skewed to the right because the
lower 50% of the values are spread over a smaller range than the upper 50% of the values.

The observations that fall outside the two inner fences are called outliers. These outliers can be classified into two
kinds of outliers—mild and extreme outliers. To do so, we define two outer fences—a lower outer fence at 3.0 ×
IQR below the first quartile and an upper outer fence at 3.0 × IQR above the third quartile. If an observation is
outside either of the two inner fences but within the two outer fences, it is called a mild outlier. An observation
that is outside either of the two outer fences is called an extreme outlier. For Example, the two outer fences are
calculated as follows.

3 × 𝐼𝑄𝑅 = 3 × 24 = 72
𝐿𝑜𝑤𝑒𝑟 𝑜𝑢𝑡𝑒𝑟 𝑓𝑒𝑛𝑐𝑒 = 𝑄1 − 72 = 77 − 72 = 5
𝑈𝑝𝑝𝑒𝑟 𝑜𝑢𝑡𝑒𝑟 𝑓𝑒𝑛𝑐𝑒 = 𝑄3 + 72 = 101 + 72 = 173
Because 144 is outside the upper inner fence but inside the upper outer fence, it is a mild outlier. Using the box-
and-whisker plot, we can conclude whether the distribution of our data is symmetric, skewed to the right, or
skewed to the left. If the line representing the median is in the middle of the box and the two whiskers are of about

Page | 7
the same length, then the data have a symmetric distribution. If the line representing the median is not in the
middle of the box and/or the two whiskers are not of the same length, then the distribution of data values is skewed.

The distribution is skewed to the right if the median is to the left of the center of the box with the right-side
whisker equal to or longer than the whisker on the left side, or if the median is in the center of the box but the
whisker on the right side is longer than the one on the left side. The distribution is skewed to the left if the median
is to the right of the center of the box with the left side whisker equal to or longer than the whisker on the right
side, or if the median is in the center of the box but the whisker on the left side is longer than the one on the right
side.

Exercise: Briefly explain what summary measures are used to construct a box-and-whisker plot. Prepare a box-
and-whisker plot for the following data:

Does this data set contain any outliers?

Example 05: Consider the following set of observations. Take X to be the exploratory variable and Y to be the
response variable.

Draw a scatter plot for the above data. Comment on the suitability of using simple linear regression to describe
the relationship.

Exercise: An auto manufacturing company wanted to investigate how the price of one of its car models
depreciates with age. The research department at the company took a sample of eight cars of this model and
collected the following information on the ages (in years) and prices (in hundreds of dollars) of these cars.

Construct a scatter diagram for these data. Does the scatter diagram exhibit a linear relationship between ages and
prices of cars?

Page | 8
Characteristics of Central Tendency and Dispersal
Measures of Center for Ungrouped Data
We often represent a data set by numerical summary measures, usually called the typical values. A measure of
center gives the center of a histogram or a frequency distribution curve. This section discusses five different
measures of center: the mean, the median, the trimmed mean, the weighted mean, and the mode. However, another
measure of center, the geometric mean, is explained in an exercise following this section. We will learn how to
calculate each of these measures for ungrouped data. Recall from Chapter 2 that the data that give information on
each member of the population or sample individually are called ungrouped data, whereas grouped data are
presented in the form of a frequency distribution table.

1. Mean

The mean, also called the arithmetic mean, is the most frequently used measure of center. This book will use the
words mean and average synonymously. For ungrouped data, the mean is obtained by dividing the sum of all
values by the number of values in the data set:

The mean calculated for sample data is denoted by x (read as “x bar”), and the mean calculated for population
data is denoted by μ (Greek letter mu).

Calculating Mean for Ungrouped Data: The mean for ungrouped data is obtained by dividing the sum of all
values by the number of values in the data set. Thus,

where ∑x is the sum of all values, N is the population size, n is the sample size, μ is the population mean, and x̅ is
the sample mean.

Example 1: 2014 Profits of 10 U.S. Companies, the table below lists the total profits (in million dollars) of 10
U.S. companies for the year 2014 (www.fortune.com).

Page | 9
Find the mean of the 2014 profits for these 10 companies.

Solution: The variable in this example is 2014 profits of a company.


Let us denote this variable by x. The 10 values of x are given in the
above table. By adding these 10 values, we obtain the sum of x values,
that is:

∑x = 37,037 + 18,249 + 11,431 + 32,580 + 5346 + 13,057 + 5113 +


5385 + 16,483 + 16,022 = 160,703

Note that the given data include only 10 companies. Hence, it represents a sample with n = 10. Substituting the
values of ∑x and n in the sample formula, we obtain the mean of 2014 profits of 10 companies as follows:

Thus, these 10 companies earned an average of $16,070.3 million profits in 2014.

Example 2: Ages of Employees of a Company, the following are the ages (in years) of all eight employees of a
small company: 53, 32, 61, 27, 39, 44, 49, 57. Find the mean age of these employees.

Solution: Because the given data set includes all eight employees of the company, it represents the population.
Hence, N = 8. We have;

∑x = 53 + 32 + 61 + 27 + 39 + 44 + 49 + 57 = 362

The population mean is

Thus, the mean age of all eight employees of this company is 45.25 years, or 45 years and 3 months.

Reconsider Example 2. If we take a sample of three employees from this company and calculate the mean age
of those three employees, this mean will be denoted by x. Suppose the three values included in the sample are 32,
39, and 57. Then, the mean age for this sample is

If we take a second sample of three employees of this company, the value of x will (most likely) be different.
Suppose the second sample includes the values 53, 27, and 44. Then, the mean age for this sample is

Page | 10
Consequently, we can state that the value of the population mean μ is constant. However, the value of the sample
mean x varies from sample to sample. The value of x for a particular sample depends on what values of the
population are included in that sample.

- Trimmed Mean

The mean as a measure of center is impacted by outliers. When a data set contains outliers, we can use either
median or trimmed mean as a measure of the center of a data set.

Note: After we drop k% of the values from each end of a ranked data set, the mean of the remaining values is
called the k% trimmed mean.

Thus, to calculate the trimmed mean for a data set, first we rank the given data in increasing order. Then we drop
k% of the values from each end of the ranked data where k is any positive number, such as 5%, 10%, and so on.
The mean of the remaining values is called the k% trimmed mean. Remember that, although we drop a total of 2
× k% of the values, k% from each end, it is called the k% trimmed mean. The following example illustrates the
calculation of the trimmed mean.

Example: Money Spent on Books by Students, the following data give the money spent (in dollars) on books
during 2015 by 10 students selected from a small college. 890, 1354, 1861, 1644, 87, 5403, 1429, 1993, 938,
2176. Calculate the 10% trimmed mean.

Solution: To calculate the trimmed mean, first we rank the given data as below.

87, 890, 938, 1354, 1429, 1644, 1861, 1993, 2176, 5403

To calculate the 10% trimmed mean, we drop 10% of the data values from each end of the ranked

data. 10% of 10 values = 10 (.10) = 1

Hence, we drop one value from each end of the ranked data. After we drop the two values, one from each end, we
are left with the following eight values: 890, 938, 1354, 1429, 1644, 1861, 1993, 2176.

The mean of these eight values will be called the 10% trimmed mean. Since there are 8 values, n = 8. Adding
these eight values, we obtain ∑x as follows:

∑x = 890 + 938 + 1354 + 1429 + 1644 + 1861 + 1993 + 2176 = 12,285

The 10% trimmed mean will be obtained by dividing 12,285 by 8 as follows:

Thus, by dropping 10% of the values from each end of the ranked data for this example, we can state that students
spent an average of $1535.63 on books in 2015.

Since in this data set $87 and $5403 can be considered outliers, it makes sense to drop these two values and
calculate the trimmed mean for the remaining values rather than calculating the mean of all 10 values.

Page | 11
- Weighted Mean

In many cases, when we want to find the center of a data set, different values in the data set may have different
frequencies or different weights. We will have to consider the weights of different values to find the correct mean
of such a data set. For example, suppose Maura bought gas for her car four times during Nov 2023. She bought
10 gallons at a price of $2.60 a gallon, 13 gallons at a price of $2.80 a gallon, 8 gallons at a price of $2.70 a gallon,
and 15 gallons at a price of $2.75 a gallon. In such a case, the mean of the four prices as calculated in Section
3.1.1 will not give the actual mean price paid by Maura in June 2015. We cannot add the four prices and divide
by four to find the average price. That can be done only if she bought the same amount of gas each time. But
because the amount of gas bought each time is different, we will have to calculate the weighted mean in this case.
The amounts of gas bought will be considered the weight here.

When different values of a data set occur with different frequencies, that is, each value of a data set is assigned
different weight, then we calculate the weighted mean to find the center of the given data set.

To calculate the weighted mean for a data set, we denote the variable by x and the weights by w. We add all the
weights and denote this sum by ∑w. Then we multiply each value of x by the corresponding value of w. The sum
of the resulting products gives ∑xw. Dividing ∑xw by ∑w gives the weighted mean.

Example: Prices and Amounts of Gas Purchased, Maura bought gas for her car four times during Nov 2023. She
bought 10 gallons at a price of $2.60 a gallon, 13 gallons at a price of $2.80 a gallon, 8 gallons at a price of $2.70
a gallon, and 15 gallons at a price of $2.75 a gallon. What is the average price that Maura paid for gas during

Nov 2023?

Solution: Here the variable is the price of gas per gallon, and we will denote it by x. The weights are the number
of gallons bought each time, and we will denote these weights by w. We list the values of x and w in Table 3.3,
and find ∑w. Then we multiply each value of x by the corresponding value of w and obtain ∑xw by adding the
resulting values. Finally, we divide ∑xw by ∑w to find the weighted mean.

Page | 12
Thus, Maura paid an average of $2.72 a gallon for the gas she bought in Nov 2023.

2. Median

The median is the value that divides a data set that has been ranked in increasing order in two equal halves. If the
data set has an odd number of values, the median is given by the value of the middle term in the ranked data set.
If the data set has an even number of values, the median is given by the average of the two middle values in the
ranked data set.

As is obvious from the definition of the median, it divides a ranked data set into two equal parts. The calculation
of the median consists of the following two steps:

1. Rank the given data set in increasing order.

2. Find the value that divides the ranked data set in two equal parts. This value gives the median.

Note that if the number of observations in a data set is odd, then the median is given by the value of the middle
term in the ranked data. However, if the number of observations is even, then the median is given by the average
of the values of the two middle terms.

Example 1: Compensations of Female CEOs, the table below lists the 2014 compensations of female CEOs of
11 American companies (USA TODAY, May 1, 2015). (The compensation of Carol Meyrowitz of TJX is for the
fiscal year ending in January 2015.)

Find the median for these data.

Solution: To calculate the median of this data set,


we perform the following two steps.

Step 1: The first step is to rank the given data. We


rank the given data in increasing order as follows:
16.2, 16.9, 19.3, 19.3, 19.6, 21.0, 22.2, 22.5, 28.7,
33.7, 42.1

Step 2: The second step is to find the value that divides this ranked data set in two equal parts. Here there are 11
data values. The sixth value divides these 11 values in two equal parts. Hence, the sixth value gives the median
as shown below.

Thus, the median of 2014 compensations for these 11 female CEOs is $21.0 million. Note that in this example,
there are 11 data values, which is an odd number. Hence, there is one value in the middle that is given by the sixth

Page | 13
term, and its value is the median. Using the value of the median, we can say that half of these CEOs made less
than $21.0 million and the other half made more than $21.0 million in 2014.

Exercise: Cell Phone Minutes Used, the following data give the cell phone minutes used last month by 12
randomly selected persons. 230, 2053, 160, 397, 510, 380, 263, 3864, 184, 201, 326, 721. Find the median for
these data.

NB: The median gives the center of a histogram, with half of the data values to the left of the median and half to
the right of the median. The advantage of using the median as a measure of center is that it is not influenced by
outliers. Consequently, the median is preferred over the mean as a measure of center for data sets that contain
outliers.

3. Mode

Mode is a French word that means fashion—an item that is most popular or common. In statistics, the mode
represents the most common value in a data set. The mode is the value that occurs with the highest frequency in
a data set.

Example 1: Speeds of Cars, the following data give the speeds (in miles per hour) of eight cars that were stopped
on I-95 for speeding violations. 77, 82, 74, 81, 79, 84, 74, 78

Find the mode.

Solution: In this data set, 74 occurs twice, and each of the remaining values occurs only once. Because 74 occurs
with the highest frequency, it is the mode. Therefore,

Mode = 74 miles per hour

A major shortcoming of the mode is that a data set may have none or may have more than one mode, whereas it
will have only one mean and only one median. For instance, a data set with each value occurring only once has
no mode. A data set with only one value occurring with the highest frequency has only one mode. The data set in
this case is called unimodal as in

If more than two values in a data set occur with the same (highest) frequency, then the data set contains more than
two modes and it is said to be multimodal.

Exercise: Commuting Times of Employees, a small company has 12 employees. Their commuting times (rounded
to the nearest minute) from home to work are 23, 36, 14, 23, 47, 32, 8, 14, 26, 31, 18, and 28, respectively. Find
the mode for these data.

Summary: To summarize, we cannot say for sure which of the various measures of center is a better measure
overall. Each of them may be better under different situations. Probably the mean is the most-used measure of
center, followed by the median. The mean has the advantage that its calculation includes each value of the data
Page | 14
set. The median and trimmed mean are better measures when a data set includes outliers. The mode is simple to
locate, but it is not of much use in practical applications.

Miscellaneous Exercises

1. Twenty randomly selected married couples were asked how long they have been married. Their responses
(rounded to years) are listed below.

a. Calculate the mean, median, and mode for these data.


b. Calculate the 10% trimmed mean for these data.

2. The following data give the 2015 bonuses (in thousands of dollars) of 10 randomly selected Wall Street
managers.127, 82, 45, 99, 153, 3261, 77, 108, 68, 278.

a. Calculate the mean and median for these data. b. Do these data have a mode? Explain why or why not.

c. Calculate the 10% trimmed mean for these data. d. This data set has one outlier. Which summary
measures are better for these data?

Measures of Dispersion for Ungrouped Data


The measures of center, such as the mean, median, and mode, do not reveal the whole picture of the distribution
of a data set. Two data sets with the same mean may have completely different spreads. The variation among the
values of observations for one data set may be much larger or smaller than for the other data set. (Note that the
words dispersion, spread, and variation have similar meanings.) Consider the following two data sets on the ages
(in years) of all workers at each of two small companies.

The mean age of workers in both these companies is the same, 40 years. If we do not know the ages of individual
workers at these two companies and are told only that the mean age of the workers at both companies is the same,
we may deduce that the workers at these two companies have a similar age distribution. As we can observe,
however, the variation in the workers’ ages for each of these two companies is very different. As illustrated in the
diagram, the ages of the workers at the second company have a much larger variation than the ages of the workers
at the first company.

Thus, a summary measure such as the mean, median, or mode by itself is usually not a sufficient measure to reveal
the shape of the distribution of a data set. We also need a measure that can provide some information about the

Page | 15
variation among data values. The measures that help us learn about the spread of a data set are called the measures
of dispersion. The measures of center and dispersion taken together give a better picture of a data set than the
measures of center alone. This section discusses four measures of dispersion: range, variance, standard
deviation, and coefficient of variation.

1. Range

The range is the simplest measure of dispersion to calculate. It is obtained by taking the difference between the
largest and the smallest values in a data set.

Finding the Range for Ungrouped Data:

𝑹𝒂𝒏𝒈𝒆 = 𝑳𝒂𝒓𝒈𝒆𝒔𝒕 𝒗𝒂𝒍𝒖𝒆 − 𝑺𝒎𝒂𝒍𝒍𝒆𝒔𝒕 𝒗𝒂𝒍𝒖𝒆

Example: Total Areas of Four States, the table below gives the total areas in square miles of the four western
South-Central states of the United States.

Find the range for this data set.

Solution: The largest total area for a state in this data set is 267,277 square miles,
and the smallest area is 49,651 square miles. Therefore,

Range = Largest value − Smallest value = 267,277 − 49,651 = 217,626 square miles

Thus, the total areas of these four states are spread over a range of 217,626 square miles.

The range, like the mean, has the disadvantage of being influenced by outliers. In the above example, if the state
of Texas with a total area of 267,277 square miles is dropped, the range decreases from 217,626 square miles to
69,903 − 49,651 = 20,252 square miles. Consequently, the range is not a good measure of dispersion to use for a
data set that contains outliers. This indicates that the range is a nonresistant measure of dispersion.

Another disadvantage of using the range as a measure of dispersion is that its calculation is based on two values
only: the largest and the smallest. All other values in a data set are ignored when calculating the range. Thus, the
range is not a very satisfactory measure of dispersion.

2. Variance and Standard Deviation

The standard deviation is the most-used measure of dispersion. The value of the standard deviation tells how
closely the values of a data set are clustered around the mean. In general, a lower value of the standard deviation
for a data set indicates that the values of that data set are spread over a relatively smaller range around the mean.
In contrast, a larger value of the standard deviation for a data set indicates that the values of that data set are spread
over a relatively larger range around the mean.
Page | 16
The standard deviation is obtained by taking the positive square root of the variance. The variance calculated for
population data is denoted by σ2 (read as sigma squared), and the variance calculated for sample data is denoted
by s2. Consequently, the standard deviation calculated for population data is denoted by σ, and the standard
deviation calculated for sample data is denoted by s. Following are what we will call the basic formulas that are
used to calculate the variance and standard deviation.

where σ2 is the population variance, s2 is the sample variance, σ is the population standard deviation, and s is the
sample standard deviation.

The quantity x − μ or x − x̅ in the above formulas is called the deviation of the x value from the mean. The sum
of the deviations of the x values from the mean is always zero; that is,

∑ (x − μ) = 0 and ∑ (x − x̅) = 0.

Note that the denominator in the formula for the population variance is N, but that in the formula for the sample
variance it is n − 1. Thus, to calculate the variance and standard deviation for a data set, we perform the following
three steps: In the first step, we add all the values of x in the given data set and denote this sum by ∑x. In the
second step, we square each value of x and add these squared values, which is denoted by ∑x 2. In the third step,
we substitute the values of n, ∑x, and ∑x2 in the corresponding formula for variance or standard deviation and
simplify.

Example: Compensations of Female CEOs, the 2014 compensations of 11 female CEOs of American companies
given on the table below;

Page | 17
Find the variance and standard deviation for these
data.

Solution Let x denote the 2014 compensations (in


millions of dollars) of female CEOs of American
companies.

The calculation of ∑x and ∑x2 is shown in the table below;

Calculation of the variance involves the following three steps.

Step 1. Calculate ∑x.

The sum of the values in the first column the table gives the
value of ∑x, which is 261.5.

Step 2. Find ∑x2

The value of ∑x2 is obtained by squaring each value of x and then adding the squared values. The results of this
step are shown in the second column of Table 3.6. Notice that ∑x2 = 6849.07.

Step 3. Determine the variance.

Substitute the values of n, ∑x, and ∑x2 in the variance formula and simplify. Because the given data are on the
2014 compensations of 11 female CEOs of American companies, we use the formula for the sample variance
using n = 11.

Now to obtain the standard deviation, we take the (positive) square root of the variance. Thus,

Thus, the standard deviation of the 2014 compensations of these 11 female CEOs of American companies is $7.95
million.

Summary and Observations


Page | 18
1. The values of the variance and the standard deviation are never negative. That is, the numerator in the
formula for the variance should never produce a negative value. Usually, the values of the variance and standard
deviation are positive, but if a data set has no variation, then the variance and standard deviation are both zero.
For example, if four persons in a group are the same age—say, 35 years—then the four values in the data set are

35, 35, 35, 35

If we calculate the variance and standard deviation for these data, their values will be zero. This is because there
is no variation in the values of this data set.

2. The measurement units of the variance are always the square of the measurement units of the original
data. This is so because the original values are squared to calculate the variance. In the last solved example above,
the measurement units of the original data are millions of dollars. However, the measurement units of the variance
are squared millions of dollars, which, of course, does not make any sense. Thus, the variance of the 2014
compensations of 11 female CEOs of American companies is 63.2502 squared million dollars. But the standard
deviation of these compensations is $7.95 million. The measurement units of the standard deviation are the same
as the measurement units of the original data because the standard deviation is obtained by taking the square root
of the variance.

Note: The variance and standard deviation are nonresistant summary measures as their values are sensitive to the
outliers. The existence of outliers in a data set will increase the values of the variance and standard deviation.

Warning: Note that ∑x2 is not the same as (∑x)2. The value of ∑x2 is obtained by squaring the x values and then
adding them. The value of (∑x)2 is obtained by squaring the value of ∑x.

3. Coefficient of Variation

One disadvantage of the standard deviation as a measure of dispersion is that it is a measure of absolute variability
and not of relative variability. Sometimes we may need to compare the variability for two different data sets that
have different units of measurement. In such cases, a measure of relative variability is preferable. One such
measure is the coefficient of variation.

The coefficient of variation, denoted by CV, expresses standard deviation as a percentage of the mean and is
computed as follows;

Note that the coefficient of variation does not have any units of
measurement, as it is always expressed as a percent.

Thus, to calculate the coefficient of variation, we perform the following steps: First we calculate the mean and
standard deviation for the given data set. Then we divide the standard deviation by the mean and multiply the
answer by 100%. The following example shows the calculation of the coefficient of variation.

Example: Salaries and Education, the yearly salaries of all employees working for a large company have a mean
of $72,350 and a standard deviation of $12,820. The years of schooling (education) for the same employees have
Page | 19
a mean of 15 years and a standard deviation of 2 years. Is the relative variation in the salaries higher or lower than
that in years of schooling for these employees? Answer the question by calculating the coefficient of variation for
each variable.

Solution: Because the two variables (salary and years of schooling) have different units of measurement (dollars
and years, respectively), we cannot directly compare the two standard deviations. Hence, we calculate the
coefficient of variation for each of these data sets.

Thus, the standard deviation for salaries is 17.72% of its mean and that for years of schooling is 13.33% of its
mean. Since the coefficient of variation for salaries has a higher value than the coefficient of variation for years
of schooling, the salaries have a higher relative variation than the years of schooling.

Note that the coefficient of variation for salaries in the above example is 17.72%. This means that if we assume
that the mean of salaries for these employees is 100, then the standard deviation of salaries is 17.72. Similarly, if
the mean of years of schooling for these employees is 100, then the standard deviation of years of schooling is
13.33.

Exercise: The following data give the prices of seven textbooks randomly selected from a university bookstore.
$89, $170, $104, $113, $56, $161, $147

a. Find the mean for these data. Calculate the deviations of the data values from the mean. Is the sum of these
deviations zero?

b. Calculate the range, variance, standard deviation and coefficient of variation.

Mean, Variance, and Standard Deviation for Grouped Data


1. Mean for Grouped Data

We learned before that the mean is obtained by dividing the sum of all values by the number of values in a data
set. However, if the data are given in the form of a frequency table, we no longer know the values of individual
observations. Consequently, in such cases, we cannot obtain the sum of individual values. We find an
approximation for the sum of these values using the procedure explained in the next paragraph and example. The
formulas used to calculate the mean for grouped data follow.

Page | 20
To calculate the mean for grouped data, first find the midpoint of each class and then multiply the midpoints by
the frequencies of the corresponding classes. The sum of these products, denoted by ∑mf, gives an approximation
for the sum of all values. To find the value of the mean, divide this sum by the total number of observations in the
data.

Example1: Daily Commuting Times for Employees, the table below gives the frequency distribution of the daily
one-way commuting times (in minutes) from home to work for all 25 employees of a company.

Calculate the mean of the daily commuting times.

Solution: Note that because the data set includes all 25 employees of the
company, it represents the population.

The table below shows the calculation of ∑mf. Note that in the table, m denotes the midpoints of the classes.

To calculate the mean, we first find the midpoint of each class. The class midpoints are recorded in the third
column of the table. The products of the midpoints and the corresponding frequencies are listed in the fourth
column. The sum of the fourth column values, denoted by ∑mf, gives the approximate total daily commuting
time (in minutes) for all 25 employees. The mean is obtained by dividing this sum by the total frequency.
Therefore,

Thus, the employees of this company spend an average of 21.40 minutes a day commuting from home to work.

Exercise: Number of Orders Received, the table below gives the frequency distribution of the number of orders
received each day during the past 50 days at the office of a mail-order company.

Calculate the mean.

Page | 21
2. Variance and Standard Deviation for Grouped Data

Following are what we will call the basic formulas that are used to calculate the population and sample variances
for grouped data:

where σ2 is the population variance, s2 is the sample variance, and m is the midpoint of a class. In either case, the
standard deviation is obtained by taking the positive square root of the variance.

Again, the short-cut formulas are more efficient for calculating the variance and standard deviation.

Example : Daily Commuting Times of Employees, the following data (using the previous data from the example
above), give the frequency distribution of the daily one-way commuting times (in minutes) from home to work
for all 25 employees of a company.

Calculate the variance and standard deviation.

Solution: All four steps needed to calculate the variance and


standard deviation for grouped data are shown after the table
below.

Page | 22
Step 1. Calculate the value of ∑mf.

To calculate the value of ∑mf, first find the midpoint m of each class (see the third column in the table) and then
multiply the corresponding class midpoints and class frequencies (see the fourth column). The value of ∑mf is
obtained by adding these products.

Thus, ∑mf = 535

Step 2. Find the value of ∑m2f.

To find the value of ∑m2f, square each m value and multiply this squared value of m by the corresponding
frequency (see the fifth column in the table). The sum of these products (that is, the sum of the fifth column) gives
∑m2f.

Hence, ∑m2f = 14,825

Step 3. Calculate the variance.

Because the data set includes all 25 employees of the company, it represents the population. Therefore, we use
the formula for the population variance:

Step 4. Calculate the standard deviation.

To obtain the standard deviation, take the (positive) square root of the variance.

Thus, the standard deviation of the daily commuting times for these employees is 11.62 minutes.

Exercise: Number of Orders Received, the following data give the frequency distribution of the number of orders
received each day during the past 50 days at the office of a mail order company;

Page | 23
Measures of Position
A measure of position determines the position of a single value in relation to other values in a sample or a
population data set.

1. Quartiles and Interquartile Range

Quartiles are the summary measures that divide a ranked data set into four equal parts. Three measures will divide
any data set into four equal parts. These three measures are the first quartile (denoted by Q1), the second quartile
(denoted by Q2), and the third quartile (denoted by Q3). The data should be ranked in increasing order before the
quartiles are determined. The quartiles are defined as follows. Note that Q1 and Q3 are also called the lower and
the upper quartiles, respectively.

Quartiles are three values that divide a ranked data set into four equal parts. The second quartile is the same as the
median of a data set. The first quartile is the median of the observations that are less than the median, and the third
quartile is the median of the observations that are greater than the median.

The figure below, describes the positions of the three quartiles.

Approximately 25% of the values in a ranked data set are less than Q1 and about 75% are greater than Q1. The
second quartile, Q2, divides a ranked data set into two equal parts; hence, the second quartile and the median are
the same. Approximately 75% of the data values are less than Q3 and about 25% are greater than Q3.

The difference between the third quartile and the first quartile for a data set is called the interquartile range (IQR),
which is a measure of dispersion.

Example: Commuting Times for College Students, a sample of 12 commuter students was selected from a college.
The following data give the typical one-way commuting times (in minutes) from home to college for these 12
students. 29, 14, 39, 17, 7, 47, 63, 37, 42, 18, 24, 55.

(a) Find the values of the three quartiles.

(b) Where does the commuting time of 47 fall in relation to the three quartiles?

(c) Find the interquartile range.

Solution: (a) We perform the following steps to find the three quartiles.

Page | 24
Step 1. First, we rank the given data in increasing order as follows: 7, 14, 17, 18, 24, 29, 37, 39, 42, 47, 55, 63.

Step 2. We find the second quartile, which is also the median. In a total of 12 data values, the median is between
sixth and seventh terms. Thus, the median and, hence, the second quartile is given by the average of the sixth and
seventh values in the ranked data set, that is the average of 29 and 37. Thus, the second quartile is:

Step 3. We find the median of the data values that are smaller than Q2, and this gives the value of the first quartile.
The values that are smaller than Q2 are:

7 14 17 18 24 29

The value that divides these six data values in two equal parts is given by the average of the two middle values,
17 and 18. Thus, the first quartile is:

Step 4. We find the median of the data values that are larger than Q2, and this gives the value of the third quartile.
The values that are larger than Q2 are:

37 39 42 47 55 63

The value that divides these six data values in two equal parts is given by the average of the two middle values,
42 and 47. Thus, the third quartile is:

Now we can summarize the calculation of the three quartiles in the following figure.

The value of Q1 = 17.5 minutes indicates that 25% of these 12 students in this sample commute for less than 17.5
minutes and 75% of them commute for more than 17.5 minutes. Similarly, Q2 = 33 indicates that half of these 12
students commute for less than 33 minutes and the other half of them commute for more than 33 minutes.

The value of Q3 = 44.5 minutes indicates that 75% of these 12 students in this sample commute for less than 44.5
minutes and 25% of them commute for more than 44.5 minutes.

(b) By looking at the position of 47 minutes, we can state that this value lies in the top 25% of the commuting
times.
Page | 25
(c) The interquartile range is given by the difference between the values of the third and the first quartiles. Thus,

IQR = Interquartile range = Q3 − Q1 = 44.5 − 17.5 = 27 minutes

Exercise: Ages of Employees, the following are the ages (in years) of nine employees of an insurance company:

47 28 39 51 33 37 59 24 33

(a) Find the values of the three quartiles. Where does the age of 28 years fall in relation to the ages of these
employees?

(b) Find the interquartile range.

2 Statistical Series of Two Variables


TO BE CONTINUED…

Page | 26

You might also like