SQQS1013 Ch2 A122
SQQS1013 Ch2 A122
SQQS1013 Ch2 A122
2.1 INTRODUCTION
DESCRIPTIVE STATISTICS
Raw data - Data recorded in the sequence in which there are collected and before they are processed or ranked Array data - Raw data that is arranged in ascending or descending order.
Example 1
Here is a list of question asked in a large statistics class and the raw data given by one of the students: 1. 2. 3. 4. 5. What is your sex (m=male, f=female)? Answer : m How many hours did you sleep last night? Answer: 5 hours Randomly pick a letter S or Q. Answer: S What is your height in inches? Answer: 67 inches Whats the fastest youve ever driven a car (mph)? Answer: 110 mph
A relative frequency distribution is a listing of all categories along with their relative frequencies (given as proportions or percentages). It is commonplace to give the frequency and relative frequency distribution together. Calculating relative frequency and percentage of a category
FORMUL A
Example 3
A sample of UUM staff-owned vehicles produced by Proton was identified and the make of each noted. The resulting sample follows ( W = Wira, Is = Iswara, Wj = Waja, St = Satria, P = Perdana, Sv = Savvy): Construct a frequency distribution table for these data with their relative frequency and percentage.
W Is Wj Wj St Solution:
W W Is Sv W
P W Wj W W
Is Wj Sv Is W
Is Is W P W
P W W Sv St
Is W W Wj St
W Is Wj Wj P
St W St W Wj
Wj Wj W W Sv
Frequency 19 8 4 10 5 4
Relative Frequency
Percentage (%)
A graph made of bars whose heights represent the frequencies of respective categories. Such a graph is most helpful when you have many categories to represent. Notice that a gap is inserted between each of the bars. It has
o o o o
simple/ vertical bar chart horizontal bar chart component bar chart multiple bar chart
Types of Vehicle
SQQS1013 Elementary Statistics To construct a component bar chart, all categories is in one bar and every bar is divided into components. The height of components should be tally with representative frequencies.
Example 4
Suppose we want to illustrate the information below, representing the number of people participating in the activities offered by an outdoor pursuits centre during Jun of three consecutive years. Climbing Caving Walking Sailing Total 2004 21 10 75 36 142 2005 34 12 85 36 167 2006 36 21 100 40 191
Solution:
Activities Breakdown (Jun)
Number of participants 200 150 100 50 0 2004 2005 Year 2006 Sailing Walking Caving Climbing
drawn simply by marking the relative frequencies or percentages, instead of the class frequencies.
b) Pie Chart A circle divided into portions that represent the relative frequencies or percentages of a population or a sample belonging to different categories. An alternative to the bar chart and useful for summarizing a single categorical variable if there are not too many categories. The chart makes it easy to compare relative sizes of each class/category. The whole pie represents the total sample or population. The pie is divided into different portions that represent the different categories. To construct a pie chart, we multiply 360 o by the relative frequency for each category to obtain the degree measure or size of the angle for the corresponding categories. Example 5
Movie Genres Comedy Action Romance Drama Horror Foreign Science Fiction Total Frequency 54 36 28 28 22 16 16 200 Relative Frequency 0.27 0.18 0.14 0.14 0.11 0.08 0.08 1.00 Angle Size 360*0.27=97.2o 360*0.18=64.8o 360*0.14=50.4o 360*0.14=50.4o 360*0.11=39.6o 360*0.08=28.8o 360*0.08=28.8o 360o
c) Line Graph/Time Series Graph A graph represents data that occur over a specific period time of time. Line graphs are more popular than all other graphs combined because their visual characteristics reveal data trends clearly and these graphs are easy to create. When analyzing the graph, look for a trend or pattern that occurs over the time period. Example is the line ascending (indicating an increase over time) or descending (indicating a decrease over time). Another thing to look for is the slope, or steepness, of the line. A line that is steep over a specific time period indicates a rapid increase or decrease over that period. Two data sets can be compared on the same graph (called a compound time series graph) if two lines are used. Data collected on the same element for the same variable at different points in time or for different periods of time are called time series data. A line graph is a visual comparison of how two variablesshown on the x- and y-axesare related or vary with each other. It shows related information by drawing a continuous line between all the points on a grid. Line graphs compare two variables: one is plotted along the x-axis (horizontal) and the other along the y-axis (vertical). The y-axis in a line graph usually indicates quantity (e.g., RM, numbers of sales litres) or percentage, while the horizontal x-axis often measures units of time. As a result, the line graph is often viewed as a time series graph
Example 6
A transit manager wishes to use the following data for a presentation showing how Port Authority Transit ridership has changed over the years. Draw a time series graph for the data and summarize the findings.
Solution:
The graph shows a decline in ridership through 1992 and then leveling off for the years 1993 and 1994.
EXERCISE 1
Chapter 2: Descriptive Statistics
a. b. c.
Construct a frequency distribution table. Calculate the relative frequencies and percentages for all categories. Draw a pie chart for the percentage distribution.
2. The frequency distribution table represents the sale of certain product in ZeeZee
Company. Each of the products was given the frequency of the sales in certain period. Find the relative frequency and the percentage of each product. Then, construct a pie chart using the obtained information.
Type of Product A B C D E Frequency 13 12 5 9 11 Relative Frequency Percentage Angle Size
3. Draw a time series graph to represent the data for the number of worldwide airline
fatalities for the given years.
Year No. of fatalities 1990 440 1991 510 1992 990 1993 801 1994 732 1995 557 1996 1132
4. A questionnaire about how people get news resulted in the following information
from 25 respondents (N = newspaper, T = television, R = radio, M = magazine). N R M T T N N M R R R T N M R T M R N N T R N M N
a. Construct a frequency distribution for the data. b. Construct a bar graph for the data. 5. The given information shows the export and import trade in million RM for four
months of sales in certain year. Using the provided information, present this data in component bar graph.
Month September October November December Export 28 30 32 24 Import 20 28 17 14
6.
The following information represents the maximum rain fall in millimeter (mm) in each state in Malaysia. You are supposed to help a meteorologist in your place to make an analysis. Based on your knowledge,
10
SQQS1013 Elementary Statistics present this information using the most appropriate chart and give your comment.
State Perlis Kedah Pulau Pinang Perak Selangor Wilayah Persekutuan Kuala Lumpur Negeri Sembilan Melaka Johor Pahang Terengganu Kelantan Sarawak Sabah Quantity (mm) 435 512 163 721 664 1003 390 223 876 1050 1255 986 878 456
In stem and leaf display of quantitative data, each value is divided into two portions a stem and a leaf. Then the leaves for each stem are shown separately in a display.
Gives the information of data pattern. Can detect which value frequently repeated.
Example 7 25 36 14 Solution: 12 13 41 9 11 38 10 12 44 5 31 13 12 28 22 23 37 18 7 6 19
11
A frequency distribution for quantitative data lists all the classes and the number of values that belong to each class. Data presented in form of frequency distribution are called grouped data. The class boundary is given by the midpoint of the upper limit of one class and the lower limit of the next class. Also called real class limit.
To find the midpoint of the upper limit of the first class and the lower limit of the second class, we divide the sum of these two limits by 2. e.g.:
class boundary
FORMUL A
FORMUL A
Chapter 2: Descriptive Statistics
12
e.g:
Constructing Frequency Distribution Tables 1. To decide the number of classes, we used Sturges formula, which is
FORMUL A
where 2. Class width,
c = 1 + 3.3 log n
c is the no. of classes n is the no. of observations in the data set.
FORMUL A
This class width is rounded up to a convenient number. 3. Lower Limit of the First Class or the Starting Point Use the smallest value in the data set.
Example 8
13
SQQS1013 Elementary Statistics The following data give the total home runs hit by all players of each of the 30 Major League Baseball teams during 2004 season.
i)
Number of classes, c
ii)
Class width,
i>
Tally
f
10 2 5 6 3 4
f = 30
Chapter 2: Descriptive Statistics
14
Example 9
(Refer example 8)
Table 2.11: Relative Frequency and Percentage Distributions
Total Home Runs 135 152 153 170 171 188 189 206 207 224 225 242
Class Boundaries 134.5 less than 152.5 152.5 less than 170.5 170.5 less than 188.5 188.5 less than 206.5 206.5 less than 224.5 224.5 less than 242.5 Total
Example
(Refer example 8)
10
b) Polygon
15
Example 11
Frequency polygon for Table 2.11
134.5
For a very large data set, as the number of classes is increased (and the width of
classes is decreased), the frequency polygon eventually becomes a smooth curve called a frequency distribution curve or simply a frequency curve.
c) Shape of Histogram
Same as polygon. For a very large data set, as the number of classes is increased (and the width of classes is decreased), the frequency polygon eventually becomes a smooth curve called a frequency distribution curve or simply a frequency curve.
The most common of shapes are: (i) Symmetric (ii) Right skewed
16
Symmetric histograms
Describing data using graphs helps us insight into the main characteristics of the data. When interpreting a graph, we should be very cautious. We should observe carefully whether the frequency axis has been truncated or whether any axis has been unnecessarily shortened or stretched.
17
Using the frequency distribution of table 2.11, Total Home Runs 135 152 153 170 171 188 189 206 207 224 225 242 Class Boundaries 134.5 less than 152.5 152.5 less than 170.5 170.5 less than 188.5 188.5 less than 206.5 206.5 less than 224.5 224.5 less than 242.5 f 10 2 5 6 3 4 Cumulative Frequency
Ogive
An ogive is a curve drawn for the cumulative frequency distribution by joining with straight lines the dots marked above the upper boundaries of classes at heights equal to the cumulative frequencies of respective classes. Two type of ogive: (i) (ii) ogive less than ogive greater than
Example
(Ogive Less Than) Earnings Number of (RM) students (f) 30 39 40 49 50 59 60 - 69 70 79 80 - 89
Cumulative Frequency
13
Earnings (RM) Less than 29.5 Less than 39.5 Less than 49.5 Less than 59.5 Less than 69.5 Less than 79.5 Less than 89.5
5 6 6 3 3 7 30
Total
35
18
Earnings
2.3.6 Box-Plot
Describe the analyze data graphically using 5 measurement: smallest value, first quartile (K1), second quartile (median or K2), third quartile (K3) and largest value.
Cumulative Frequency
29.5 39.5 49.5 59.5 69.5 79.5 89.5
Earnings
19
Smallest value
K1
Median
K3
Largest value
Smallest value
K1
Median
K3
Largest value
=
x=
x
N
x
n
where:
x =
x
the sum of all values N = the population size n = the sample size, = the population mean = the sample mean
Example 15 following data give the prices (rounded to thousand RM) of five homes sold The recently in Sekayang. 158 189 265 127 191
20
Solution:
Thus, these five homes were sold for an average price of RM186 thousand @ RM186 000.
The mean has the advantage that its calculation includes each value of the data set. Weighted Mean
xw =
wx w
where w is a weight.
Example 16
Consider the data of electricity components purchasing from a factory in the table below: Type Number of component (w) Cost/unit (x)
21
SQQS1013 Elementary Statistics 1 2 3 4 5 Total 1200 500 2500 1000 800 6000 RM3.00 RM3.40 RM2.80 RM2.90 RM3.25
Solution:
xw = =
1200(3) + 500(3.4) + 2500(2.8) +1000(2.9) +800(3.25) 1200 + 500 + 2500 +1000 +800 17800 = 6000 = 2.967
Mean cost of a unit of the component is RM2.97
wx w
Median
Median is the value of the middle term in a data set that has been ranked in increasing order. Procedure for finding the Median Step 1: Rank the data set in increasing order. Step 2: Determine the depth (position or location) of the median.
FORMUL A
+1 Depth of Median = n 2
Step 3: Determine the value of the Median.
Example
Find the median for the following data: 10 5 19
17
Solution:
(1) Rank the data in increasing order
22
(2)
Determine the depth of the Median n +1 Depth of Median = 2 5 +1 = 2 =3 (3) Determine the value of the median Therefore the median is located in third position of the data set.
Depth of Median =
n +1 2 6 +1 = 2 = 3.5
Median =
8 + 10 =9 2
The median gives the center of a histogram, with half of the data values to the left of (or, less than) the median and half to the right of (or, more than) the median. The advantage of using the median is that it is not influenced by outliers.
Mode
Mode is the value that occurs with the highest frequency in a data set. 23
A major shortcoming of the mode is that a data set may have none or may have more than one mode. One advantage of the mode is that it can be calculated for both kinds of data, quantitative and qualitative.
fx = N
x=
Where
fx
n
Example 20
The following table gives the frequency distribution of the number of orders received each day during the past 50 days at the office of a mail-order company. Calculate the mean. Number of order 10 12 13 15 16 18 19 21 f 4 12 20 14 n = 50
Chapter 2: Descriptive Statistics
24
Solution:
Because the data set includes only 50 days, it represents a sample. The value of fx is calculated in the following table:
Number of order 10 12 13 15 16 18 19 21
f 4 12 20 14 n = 50
fx
Thus, this mail-order company received an average of 16.64 orders per day during these 50 days.
Median
Step 1: Construct the cumulative frequency distribution. Step 2: Decide the class that contain the median.
Class Median is the first class with the value of cumulative frequency is at least n/2.
FORMUL A
Example 21
n F Median = Lm + 2 i f m
Where: n = the total frequency F = the total frequency before class median i = the class width = the lower boundary of the class median = the frequency of the class median
Based on the grouped data below, find the median: Time to travel to work Frequency
25
Solution:
1st Step: Construct the cumulative frequency distribution Time to travel to work 1 10 11 20 21 30 31 40 41 50 Frequency 8 14 12 9 7 Cumulative Frequency
Thus, 25 persons take less than 23 minutes to travel to work and another 25 persons take more than 23 minutes to travel to work.
Mode
Mode is the value that has the highest frequency in a data set. For grouped data, class mode (or, modal class) is the class with the highest frequency.
26
Mode = L
Where:
mo
1 + i + 1 2
Lmo
is the lower boundary of class mode is the difference between the frequency of class mode and the frequency of the class before the class mode is the difference between the frequency of class mode and the frequency of the class after the class mode is the class width
1 2
i
Example
Based on the grouped data below, find the mode Time to travel to work 1 10 11 20 21 30 31 40 41 50 Frequency 8 14 12 9 7
22
Solution:
Based on the table,
27
Mean, median, and mode for a symmetric histogram and frequency distribution curve
(2)
For a histogram and a frequency curve skewed to the right, the value of the mean is the largest that of the mode is the smallest and the value of the median lies between these two.
28
Mean, median, and mode for a histogram and frequency distribution curve skewed to the right
(3)
For a histogram and a frequency curve skewed to the left, the value of the mean is the smallest and that of the mode is the largest and the value of the median lies between these two.
Mean, median, and mode for a histogram and frequency distribution curve skewed to the left
29
Two data sets with the same mean may have a completely different spreads. The variation among the values of observations for one data set may be much larger or smaller than for the other data set.
Example 23
Solution: Range = Largest value Smallest value = 267 277 49 651 = 217 626
Disadvantages: o o being influenced by outliers. based on two values only. All other values in a data set are ignored.
Standard deviation is the most used measure of dispersion. A Standard Deviation value tells how closely the values of a data set clustered around the mean.
30
Lower value of standard deviation indicates that the data set value are spread over relatively smaller range around the mean. Larger value of data set indicates that the data set value are spread over relatively larger around the mean (far from mean). Standard deviation is obtained the positive root of the variance: FORMUL A Variance for population:
=
2
( x)
N N
2
s =
2
( x)
n
n 1
FORMUL A
Example 24
Let x denote the total production (in unit) of company Company A B C D E Find the variance and standard deviation, Production 62 93 126 75 34
Solution:
Company A B C D E
Chapter 2: Descriptive Statistics
x2
31
390
FORMUL A
32
SQQS1013 Elementary Statistics Class 41 50 51 60 61 70 71 80 81 90 91 - 100 Total Upper bound of last class = 100.5 Lower bound of first class = 40.5 Range = 100.5 40.5 = 60 Frequency 1 3 7 13 10 6 40
2 =
fx
( fx )
N N
s2 =
FORMUL A
fx
( fx )
n 1 n
Example 25
Find the variance and standard deviation for the following data: No. of order 10 12 13 15 16 18 19 21 f 4 12 20 14
33
Solution:
No. of order 10 12 13 15 16 18 19 21 Total f 4 12 20 14 n = 50 x fx fx2
Variance,
Standard Deviation,
Thus, the standard deviation of the number of orders received at the office of this mailorder company during the past 50 days is 2.75.
34
FORMUL A
Example
Given mean and standard deviation of monthly salary for two groups of worker who 26 are working in ABC company- Group 1: 700 & 20 and Group 2 :1070 & 20. Find the CV for every group and determine which group is more dispersed.
Solution:
Determines the position of a single value in relation to other values in a sample or a population data set. Quartiles Quartiles are three summary measures that divide ranked data set into four equal parts.
Chapter 2: Descriptive Statistics
Depth of Q1 =
n +1 4
35
o The 2nd quartiles median of a data set or Q2 o The 3rd quartiles denoted as Q3
FORMUL A
Example
Depth of Q3 =
3( n + 1) 4
Table below lists the total revenue for the 11 top tourism company in Malaysia 27
79.9
21.2
76.4
80.2
82.1
79.4
89.3
98.0
103.5
Step 1: Arrange the data in increasing order 76.4 121.2 Step 2: Determine the depth for Q1 and Q3 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7
Depth of Q1 =
n + 1 11 + 1 = =3 4 4
3 ( 11 + 1) 3( n + 1) = =9 4 4
Depth of Q3 =
Step 3: Determine the Q1 and Q3 76.4 121.2 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7
109.7 98.0
79.9 103.5
74.1 86.8
121.2
76.4
80.2
82.1
79.4
89.3
Solution:
Step 1: Arrange the data in increasing order 74.1 76.4 121.2
Chapter 2: Descriptive Statistics
79.4
79.9
80.2
82.1
86.8
89.3
98.0 103.5
109.7
36
Depth of Q3 =
Step 3: Determine the Q1 and Q3 74.1 76.4 121.2 Q1 = 79.4 + 0.25 (79.9 79.4) = 79.525 Q3 = 98.0 + 0.75 (103.5 98.0) = 102.125 79.4 79.9 80.2 82.1 86.8 89.3 98.0 103.5 109.7
Interquartile Range The difference between the third quartile and the first quartile for a data set.
FORMUL A
Example 29
IQR = Q3 Q1
FORMUL A
n 4-F Q1 = LQ1 + f Q1
37
3n 4 -F Q3 = LQ3 + f Q3
Example 30
Refer to example 22, find Q1 and Q3
Solution:
1st Step: Construct the cumulative frequency distribution Time to travel to work 1 10 11 20 21 30 31 40 41 50 Frequency 8 14 12 9 7 Cumulative Frequency 8 22 34 43 50
Class Q1 =
n 50 = =12.5 4 4
Therefore,
38
Therefore,
Example 31
FORMUL A
IQR = Q3 Q1
Solution:
IQR = Q3 Q1 = 34.3889 13.7143 = 20.6746
To determine the skewness of data (symmetry, left skewed, right skewed) Also called Skewness Coefficient or Pearson Coefficient of Skewness
sk =
If Sk = 0 symmetry If Sk takes a value in between (-0.9999, -0.0001) or (0.0001, 0.9999) approximately symmetry.
Example 32 The duration of cancer patient warded in Hospital Seberang Jaya recorded in a
frequency distribution. From the record, the mean is 28 days, median is 25 days and mode is 23 days. Given the standard deviation is 4.2 days. a. What is the type of distribution? b. Find the skewness coefficient
Solution:
This distribution is right skewed because the mean is the largest value
Sk =
Sk =
ADDITIONAL INFORMATION
Use of Standard Deviation 1. Chebyshevs Theorem
According to Chebyshevs Theorem, for any number k greater than 1, at least (1 1/k2) of the data values lie within k standard deviations of the mean.
1 k2 1 =1 ( 2) 2 = 0.75 @ 75% =1
Chapter 2: Descriptive Statistics
40
Thus; for example if k = 2, then Therefore, according to Chebyshevs Theorem, at least 75% of the values of a data set lie within two standard deviation of the mean
2. Empirical Rule
For a bell-shaped distribution, approximately
1.68%of the observations lie within one standard deviation of the mean. 2.95% of the observations lie within two standard deviations of mean. 3.99.7% of the observations lie within three standard deviations of the mean.
Measure of Position 1. Ungrouped Data - Quartile Deviation QD is a mean for Interquartile Range It used to compare the dissemination of two data set. If the QD value is high, it means that the data is more disseminated.
41
2.
Pk = value of the (kn)th term in a ranked set 100 Where: k = the number of percentile n = the sample size
Percentile rank of xi = Number of values than xi X 100 Total number of values in the data set
42
EXERCISE 2
1. A survey research company asks 100 people how many times they have been to the dentist in the last five years. Their grouped responses appear below. Number of Visits 04 59 10 14 15 19 Number of Responses 16 25 48 11
2. A researcher asked 25 consumers: How much would you pay for a television adapter that provides Internet access? Their grouped responses are as follows: Amount ($) 0 99 100 199 200 249 250 299 300 349 350 399 400 499 500 999 Number of Responses 2 2 3 3 6 3 4 2
3.
The following data give the pairs of shoes sold per day by a particular shoe store in the last 20 days. 85 89 90 86 89 71 70 76 79 77 80 89 83 70 83 65 75 90 76 86
Calculate the a. mean and interpret the value. b. d. median and interpret the value. standard deviation. c.mode and interpret the value.
4.
The followings data shows the information of serving time (in minutes) for 40 customers in a post office: 2.0 4.5 2.5 2.9 4.2 2.9 3.5 3.2 2.9 4.0 3.0 3.8 2.5 2.3 2.1 3.1 3.6 4.3 4.7 2.6 4.1 4.6 2.8 5.1 2.7 2.6 4.4 3.5 2.7 3.9 2.9 2.9 2.5 3.7 3.3 a.Construct a frequency distribution table with 0.5 of class width. 2.8 3.5 3.1 3.0 2.4
43
SQQS1013 Elementary Statistics b.Construct a histogram. c.Calculate the mode and median of the data. d.Find the mean of serving time. e.Determine the skewness of the data. f.Find the first and third quartile value of the data. g.Determine the value of interquartile range.
5.
In a survey for a class of final semester student, a group of data was obtained for the number of text books owned. Number of students 12 9 11 15 10 8 Number of text book owned 5 5 3 2 1 0
Find the average number of text book for the class. Use the weighted mean.
6.The following data represent the ages of 15 people buying lift tickets at a ski area. 15 30 25 53 26 28 17 40 38 20 16 35 60 31 21
Calculate the quartile and interquartile range. 7.A student scores 60 on a mathematics test that has a mean of 54 and a standard deviation of 3, and she scores 80 on a history test with a mean of 75 and a standard deviation of 2. On which test did she perform better? 8.The following table gives the distribution of the shares price for ABC Company which was listed in BSKL in 2005. Price (RM) 12 14 15 17 18 20 21 23 24 26 27 - 29 Frequency 5 14 25 7 6 3
44