Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Management

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

STATISTICS

A Review

Libeeth B. Guevarra

July 24, 2019


Presentation Outline

Introduction

Measures of Central Tendency

Measures of Dispersion

Measures of Relative Position

Frequency Distribution

The Normal Distribution

Correlation and Regression


STATISTICS is an art of learning from the data.
It is a branch of knowledge which deals with
collection, presentation, analysis and
interpretation of data that are subject for
variability.

According to W.A. Wallis, it maybe defined as a


body of methods for making wise decisions in the
face of uncertainty.
Areas of Statistics

Descriptive Statistics pertains to the methods


dealing with the collection, organization and
analysis of a set of data without making
conclusions, predictions or inferences about a
larger set.
Example
Presentation of the Trend of Mortality from
Suicide and Self-inflicted Injuries in the
Philippines from 1980-2000.
Inferential Statistics pertains to the methods
dealing with making inferences, estimation or
prediction about a larger set of data (population)
using the information gathered from a subset of
this larger set (sample).
Example
Fuel efficiency of a maker and a model is
determined by sampling few cars
Basic Statistical Terms

Universe or physical population is the set of all


individuals or entities under consideration or
study.

Study: The manager would like to determine the


average age of customers purchasing whitening
lotion for the month of April.
U = all customers purchasing whitening lotion
Variable is a characteristic or attribute of persons
or objects which assumes different values or label.
This is a thing that we measure, control or
manipulate in a research. This has the
characteristic that may vary from unit to unit.
If it can only assume one value, then it is called a
constant.
Classification of data:
• Qualitative Data (categorical)
Example
Marital Status, Socio-Economic Status, Religious
Sector, zip code, and military rank
• Quantitative Data (either Discrete or
Continuous)
Example
number of students in a classroom, weight and
height of a respondent, and monthly income of
managers
Statistical Population is a collection of all cases
in which the researcher is interested in a
statistical study.
The numerical measures that describe it are
parameter.
Sample is a portion or a subset of the population
from which the information is gathered.
The numerical measures that describe it are
statistic.
Some of the statistical measures and symbols are
presented in the table.

Descriptive Measure Parameter Statistic


Mean µ X
Standard Deviation σ S
2
Variance σ S2
Pearson Correlation Coefficient ρ r
Number of Cases N n
Levels of Measurement
1 Nominal
Examples: gender, race, color, and savings
account number.
2 Ordinal
Examples: socioeconomic status of families,
Class Standing (A to D), and Teacher’s
Evaluation (Excellent to Poor).
3 Interval
Examples: temperature, score in an exam,
and IQ.
4 Ratio
Examples: ratio scales are measures of time
or space, height, weight, width, area, age,
and monthly income.
Methods of Data Collection
1 Observation method

2 Experimental method

3 Use of existing studies

4 Registration method

5 Survey method
Sampling Technique

In Probability Sampling every member in the


population has known chance of being chosen as
a sample.
1 Simple Random Sampling

2 Systematic Sampling

3 Stratified Sampling

4 Cluster Sampling

5 Multi-stage Sampling
Non-probability sampling
1 Haphazard or Accidental Sampling

2 Purposive Sampling

3 Quota Sampling

4 Convenience Sampling
Organizing Data
1 Textual Method
2 Tabular Method
Parts of a Statistical Table
1 Table Heading includes the table number and
the title of the table
2 Body is the main part of the table that contains
the information or figures
3 Stubs or Classes are the classification or
categories describing the data and usually found
at the left most side of the table.
4 Caption is a designation or identification of the
information contained in a column, usually found
at the top most of the column.
3 Graphical Method
Categorical Distribution

Twenty five inductees were given a blood test to


determine their blood type.

A B B AB O
B AB B B B
O A O O O
AB AB A O B
O O O B A
Table 1: Blood type of the 25 inductees

Class Tally Frequency Percent


A |||| 4 16
B ||||| − ||| 8 32
O ||||| − |||| 9 36
AB |||| 4 16

More people have type O blood than any other


type.
Graphical
Pie Chart is used to visually depict qualitative
data. A circle divided into sections according to
the percentage of frequencies in each category of
the distribution
Bar Graph represents the data by using vertical
or horizontal bars whose heights or lenghts
represent the frequencies of the data.
Time Series Graph shows the data that have
been collected at different point in time.
Line Graph is used to show trend (increase or
decrease in quantitative data)
Pareto Chart is a type of chart that contains both
bar and line graph, where individual values are
represented in descending order by bars and the
cumulative total is represented by the line.
Presentation Outline

Introduction

Measures of Central Tendency

Measures of Dispersion

Measures of Relative Position

Frequency Distribution

The Normal Distribution

Correlation and Regression


Measures of Central Tendency

Measures of Central Tendency indicates the


center of the set of data arranged in increasing or
decreasing order of magnitude.
There are three common measures of central
tendency:
• Mean
• Median
• Mode
The mean is the most commonly used measure of
central location.
The sum of all the values of the observations
divided by the number of observations.
The sample mean which is symbolized as x̄ , used
to estimate the population mean µ.

Pn
i=1 xi
x̄ = (1)
Pnn
i=1 wi · xi
x̄ = P n (2)
w
Pn i=1 i
fi · x i
x̄ = i=1 (3)
n
Example
The heights (in meters) of the sampled mountains in the
Philippines are provided as follows in the table below.
What is the mean height of these mountains?
(http://www.pinoymountaineer.com)

Mountain Height (meters)


Mt. Apo 2956
Mt. Dulang-Dulang 2938
Mt. Pulag 2922
Mt. Kalatungan 2860
Mt. Tabayoc 2842

Example
Out of 100 numbers, 20 were 5’s, 40 were 4’s, 35 were 7’s,
and 5 were 3’s. What is the mean of the data set?
Median of the data set is the middle or center
observation when the data set is arranged in
either increasing or decreasing order.

x̃ = x n+1
2
(4)
x n2 + x n+2
2
x̃ = (5)
2

Example
Find the Median of : 9, 3, 44, 17, 15
Example
Find the Median of : 8, 3, 44, 17, 12, 6
Mode of a set of data is the most frequent value
that occur/s. The mode is more helpful measure
for discrete and qualitative types of data, and the
only measure of central location helpful for
qualitative data. In some data sets, the mode does
not always exist, and if does, it may not be
unique. Mode is not very useful for continuous
data since the measurements are precise to a
significant digit and would mostly occur only once.
Example
Find the Mode of the following set of data:
A : 9, 3, 4, 17, 15, 3
B : 9, 3, 4, 17, 15, 3, 9
C: A+ , AB, A, O, B, B + , A
Give what is being asked
1 The grades of a student on seven examinations were
85, 96, 72, 89, 95, 82, and 85. Find the student’s mean
grade.
2 Find the median of the set of numbers: 15, 18, 50, 12,
16, and 20.
3 The numbers of incorrect answers on a true-false test
for 15 students were recorded as follows: 2, 1, 3, 0, 1,
3, 6, 0, 3, 3, 5, 2, 1, 5, 3. Find the median and mode.
4 Marcelo B. Fernan’s bridge is designed to carry a
maximum load of 150,000 tons. Is the bridge
overloaded if it carries 18 vehicles having a mean
weight of 5,000 tons?
5 The average IQ of 10 students in Stat 012 is 115. If
there are 2 students with IQ 101, 3 with IQ 125, 1
with IQ 130, 3 with IQ 98. What must be the IQ of the
other student?
Presentation Outline

Introduction

Measures of Central Tendency

Measures of Dispersion

Measures of Relative Position

Frequency Distribution

The Normal Distribution

Correlation and Regression


Suppose that a hospital’s cardiology unit is
evaluating two types of pacemaker batteries.
Data below are the number of hours (in thousand)
each battery would last .
A: 45; 46; 45.8; 44.5; 45.7; 47.3; 44.3; 41.4
B: 47; 50; 41.3; 35.1; 40.9; 36.9; 50.8; 66
Should the cardiologist use battery A or battery
B?
Measures of dispersion indicate the degree to
which numerical data tend to spread about the
mean. It is used to determine the extent of the
scatter so that ways may be taken to control the
existing variation. It is used as a measure of
reliability of the average value.
General Classifications of Measures of
Dispersion
1 Measures of Absolute Dispersion

2 Measures of Relative Dispersion


Measures of Absolute Dispersion

The measures of absolute dispersion are


expressed in the units of the original observations.

Common Measures

Range is the difference between the highest score


and the lowest score.
Example
The IQ scores of 5 Accountancy students are 108,
112, 130, 115, and 105. Find the range.
Variance is the average squared deviation of the
observations from the mean.
PN
2 i=1 fi (xi − µ)2
σ = (6)
Pn N
2
2 i=1 fi (xi − x̄)
s = (7)
n−1
Standard Deviation is the positive square root of
the variance.
s
PN 2
i=1 fi (xi − µ)
σ= (8)
N
sP
n 2
i=1 fi (xi − x̄)
s= (9)
n−1
Example
Let A= 5, 5, 5, 5, 5, 5, 5, 5
B = 4, 4, 4, 5, 5, 5, 5, 6, 6, 6
C = 0, 0 , 0 , 0 , 10, 10, 10 , 10
D = 5, 7, 10, 11, 11, 15, 16, 20

Compute the range, standard deviation and


variance.
The monthly water consumption of a households in a certain
subdivision (in thousands of liters) is recorded for the year
1993. Compute for the range, sample variance, and
population standard deviation
Month Consumption
1 14.22
2 12.41
3 14.55
4 13.88
5 16.34
6 15.05
7 11.95
8 12.98
9 14.25
10 14.52
11 14.87
12 10.89
Presentation Outline

Introduction

Measures of Central Tendency

Measures of Dispersion

Measures of Relative Position

Frequency Distribution

The Normal Distribution

Correlation and Regression


Quantiles or Fractiles, are natural extension of
the median concept in that they are values which
divides a set of data into equal parts.
These are used to describe the standing or place
occupied by a data value relative to the rest of
the data.
Common Quantiles
1 Quartiles Qm , divides the set of data into 4

equal parts.
2 Deciles Dm , divides the set of data into 10

equal parts.
3 Percentiles Pm , divides the set of data into

100 equal parts.


Percentile Ranking

The pth Percentile


A value x is called the pth percentile of a data
set, provided that p% of the data value are less
than or equal to x.
#of data value less thanx + 0.5
Percentile of x = · 100
total number of data values

A teacher gives a 20-point test to 10 students. The scores


are as follows: 10, 20, 3, 5, 6, 8, 18, 12, 15 and 2.
Find the percentile rank of a score 12?
Quartile Ranking

Quartiles are values that divide a set of data into


4 equal parts, denoted by Q1 , Q2 , Q3 , Q4 .
Example
A teacher gives a 20-point test to 10 students.
The scores are as follows: 10, 20, 3, 5, 6, 8, 18,
12, 15 and 2.
Find the quartiles of the given scores.
THE STANDARD NORMAL RANDOM
A normal random variable x is standardized by
expressing its value as the number of standard
deviation σ it lies to the left or right of its mean
µ. The standardized normal random variables z is
defined as
x−µ
Z=
σ
Example
A basketball player Carl is 78 inches tall and a
volleyball player Jane is 76 inches tall. Carl is
obviously taller by 2 inches, but which player is
relatively taller? Does Carl’s height among men
exceed Jane’s height among women? Men have
mean height of 68 inches and a standard
deviation of 2.8 inches while women have mean
height of 63.6 inches and a standard deviation of
2.5 inches.
Example
The average teacher’s salary in a particular city
is P54,166. If the standard deviation is P10,200,
find the salaries corresponding to the following z
scores.
• 2
• -1.6
• 2.5

Example
The mean time to download pdf file is 12 min with
a standard deviation of 4 min. Belle’s download
time is 20 min. John’s download time is 6 min.
How can you compare Belle’s download time
compare with John?
Presentation Outline

Introduction

Measures of Central Tendency

Measures of Dispersion

Measures of Relative Position

Frequency Distribution

The Normal Distribution

Correlation and Regression


Example
Consider the following completion time (in
minutes) of the 50 students doing an activity in
the laboratory.

25 29 30 32 36 36 39 40 40 44
45 48 49 50 50 51 54 55 55 55
55 56 57 57 59 60 60 60 61 61
61 63 65 65 65 67 68 70 71 74
74 76 77 77 80 81 81 83 84 90

Ordered Array is a listing of values from the smallest to


largest values or conversely.
Stem and Leaf display of data is a device that is useful in
presenting relatively small quantitative data sets.
The Frequency Distribution Table
Frequency Distribution refers to the tabular
arrangement of data by non-overlapping classes
or categories together with their corresponding
class frequencies.
How to construct frequency distribution
1 Selecting the number of class intervals or groupings
(k). (Sturge’s rule) k = smallest integer greater than
or equal to 1 + log (n)/log (2) = 1 + 3.322log (n),
where n is the number of data.
2 compute the class width.
3 Determine the lower and the upper limit of the
intervals.
4 Determine the frequency of values falling within each
class interval.
k=7
90−25
class width = 7 = 10

Completion time (in minutes) of the 50 students


Class limits Class Boundaries Tally Frequency
25 - 34 24.5 - 34.5 |||| 4
35 - 44 34.5 - 44.5 |||||| 6
45 - 54 44.5 - 54.5 ||||| − || 7
55 - 64 54.5 - 64.5 15 tallies 15
65 - 74 64.5 - 74.5 ||||| − |||| 9
75 - 84 74.5 - 84.5 ||||| − ||| 8
85 - 94 84.5 - 94.5 | 1
Total 50 50
Graphical
A. Histogram
Histogram is a bar graph which the horizontal
scale represents classes of data values and the
vertical scale represent frequencies. The heights
of the bars correspond to the frequency values
and the bars are drawn adjacent to each other
(without gaps)
B. Frequency Polygon
Frequency polygon uses line segments connected
to points located directly above class midpoint
values.

Completion time (in minutes) of the 50 students


Class limits Class Marks Frequency
25 - 34 29.5 4
35 - 44 39.5 6
45 - 54 49.5 7
55 - 64 59.5 15
65 - 74 69. 5 9
75 - 84 79.5 8
85 - 94 89.5 1
Total 50
Cumulative Frequency Polygram (Ogive)
Ogive is a line graph that depicts cumulative
frequencies, just as the cumulative frequency
distribution.
Less than cumulative frequency tells the number
of observations which are less than the upper
class boundary of the interval.
Greater than cumulative frequency tells the
number of observations which are greater than the
lower bound of the interval.
Completion time (in minutes) of the 50 students
Class limits Class Boundaries Frequency <cf >cf
25 - 34 24.5 - 34.5 4 4 50
35 - 44 34.5 - 44.5 6 10 46
45 - 54 44.5 - 54.5 7 17 40
55 - 64 54.5 - 64.5 15 32 33
65 - 74 64.5 - 74.5 9 41 18
75 - 84 74.5 - 84.5 8 49 9
85 - 94 84.5 - 94.5 1 50 1
Total 50
Boxplot

A boxplot is also called a box - and - whisker


plot. It is a graphical representation of a summary
of five important values;
• minimum
• first quartile
• median
• third quartile
• maximum value

The five important values are also called five -


number summary of a data set. It can also be
used to detect outliers.
Steps in constructing a boxplot

1 Determine the five-number summary and the


interquartile range. Then compute the values of the
fences.
The values for the fences are given below:
Inner Fence: Q1 - 1.5IQR and Q3 + 1.5IQR
Outer Fence: Q1 - 3IQR and Q3 + 3IQR
2 Draw a box with the ends of the box at the first and
third quartiles
3 Draw a vertical line inside the box a the location of
the median
4 Draw horizontal dashed lines (called whiskers) from
the ends of the box to the minimum and maximum
values in the data set
5 Construct fences
Example
Construct a boxplot for the given data set:
Number of rooms Occupied in a resort during a
10-day period

12 12 13 14 14
16 17 19 19 25
Chebyshev’s Inequality

The Chebyshev’s inequality makes it possible to


make assertions about the proportion of data
values that must be within a certain interval. It
states that the probability that an observation
will be within k standard deviation from the mean
is at least (1 − k12 ). This also suggests that at
least (1 − k12 ) of the data values must be within k
standard deviations from the mean.
Implications of the Chebyshev’s inequality
1 For k = 2, at least 75 percent of the data
values must be within two standard
deviations of the mean.
2 For k = 3, at least 89 percent of the data
values must be within three standard
deviations of the mean.
3 For k = 4, at least 94 percent of the data
values must be within four standard
deviations of the mean.
Measures of Skewness

Skewness measures the deviation from the


symmetry.

3(µ − median)
SK = (10)
σ
3(x̄ − median)
SK = (11)
s

Example
The scores of the students in the Prelim Exam has
a median of 18 and a mean of 16. What does this
indicate about the shape of the distribution of the
scores?
Presentation Outline

Introduction

Measures of Central Tendency

Measures of Dispersion

Measures of Relative Position

Frequency Distribution

The Normal Distribution

Correlation and Regression


The normal (or Gaussian) distribution or curve is
defined as follows:
1 (x−µ)2
1 −2
f (x) = √ e σ2
σ 2π
where µ > 0 and σ > 0 are arbitrary constants.
Denote normal distribution with mean µ and
variance σ 2 by N (µ, σ 2 ).
Properties of a normal curve:
1 It is symmetrical about the mean.

2 The mean is equal to the median, which is

also equal to the mode.


3 The tails or ends are asymptotic relative to

the horizontal line.


4 The total area under the normal curve is

equal to 1 or 100%.
5 The normal curve area may be subdivided

into at least three standard scores each to


the left and to the right of the vertical axis.
In a normal distribution, approximately
1 68% of the data lie within 1 standard

deviation of the mean.


2 95% of the data lie within 2 standard

deviations of the mean.


3 99.7% of the data lie within 3 standard

deviations of the mean.


Example
A vegetable distributor knows that during the
month of August, the weights of its tomatoes are
normally distributed with a mean of 0.61 lb and a
standard deviation of 0.15 lb.
1 What percent of the tomatoes weigh less

than 0.76 lb?


2 In a shipment of 6000 tomatoes, how many

tomatoes can be expected to weigh more than


0.31 lb?
3 In a shipment of 4500 tomatoes, how many

tomatoes can be expected to weigh from 0.31


lb to 0.91 lb?
Standard Normal Distribution

The standard normal distribution is the normal


distribution that has a mean of 0 and a standard
deviation of 1.
Let z = x−µ
σ , we obtain the standard normal
distribution
1 1 2
φ(z) = √ e− 2 z

All normally distributed variables can be
transformed into the standard normally
distributed variable using the z - score.
x−µ
zx =
σ
x − x̄
zx =
s
The Standard Normal Distribution,
Areas, Percentages, and Probabilities
In the standard normal distribution, the area of
the distribution from z = a to z = b represents

1 the percentage of z-values that lie in the


interval from a to b.
2 the probability that z lies in the interval from
a to b
Find the probabilities for each, using the standard
normal distribution.
1 P(0 ≤ z ≤ 1.96)
2 P(-1.23 ≤ z ≤ 0)
3 P(z ≤ -1.77)
4 P(0.20 ≤ z ≤ 1.56)
5 P(z ≥ -1.43)
6 P(z ≥ 0.82)
• Find a z- score such that 10 percent of the
area under the standard normal curve is
above that score.

• Find a z- score such that 24 percent of the


area under the standard normal curve is
below that score.
The diameter of steel bearing is normally
distributed with mean of 12 cm and a standard
deviation of 0.9 cm.
1 What proportion of bearings will have

diameters exceeding 10.56 cm?


2 What is the probability that a bearing will

have a diameter between 10.29 and 14 cm?


3 If there are 1000 steel bearings, how many

will have a diameter between 10.29 and 14


cm?
Presentation Outline

Introduction

Measures of Central Tendency

Measures of Dispersion

Measures of Relative Position

Frequency Distribution

The Normal Distribution

Correlation and Regression


Correlation and Regression

Correlation is a statistical method used to


determine whether a relationship between
variables exists.
Regression is a statistical method used to
describe the nature of the relationship between
variables, that is, positive or negative, linear or
nonlinear.
A scatter plot is a graph of the ordered pairs
(x, y) of numbers consisting of the independent
variable x and the dependent variable y.
Example
Construct a scatter plot for the data shown for car
rental companies in City A for a recent year.

Company Cars Revenue


(in ten thousands) (in billions)
A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
The Correlation coefficient measures the strength
and direction of a linear relationship between two
variables.
The range of the correlation coefficient is from −1
to +1.
Formula for the Correlation Coefficient r
P P P
n( xy) − ( x)( y)
r=p P P P P
[n( x2 ) − ( x)2 ][n( y 2 ) − ( y)2 ]

where n is the number of data pairs.


Example
Compute the correlation coefficient for the data:

Company Cars Revenue


(in ten thousands) (in billions)
A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
If the value of the correlation coefficient is
significant, the next step is to determine the
equation of the regression line, which is the
data’s line of best fit.
This enables the researcher to see the trend and
make predictions on the basis of the data.
The equation of the least-squares line for the
ordered pairs (x1 , y1 ), (x2 , y2 ), . . . (xn , yn ) is the
line

y − ȳ = m(x − x̄)
y − ȳ = m(x − x̄)
where:
x̄ = mean of variable x
ȳ = mean of variable y
m =slope of the line
P
xy − nx̄ȳ
m= P 2
x − n(x̄)2
Example
Find the equation of the regression line for the
data

Company Cars Revenue


(in ten thousands) (in billions)
A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
Another formula for the Regression line
y = a + bx.
( y)( x2 ) − ( x)( xy)
P P P P
a= P P
n( x2 ) − ( x)2
P P P
n( xy) − ( x)( y)
b= P P
n( x2 ) − ( x)2
where a is the y intercept and b is the slope of the line.
The Coefficient of Determination is a measure of
the variation of the dependent variable that is
explained by the regression line and the
independent variable. The symbol for the
coefficient of determination is r2 . If r = 0.90, then
r2 = 0.81, which is equivalent to 81%. This result
means that 81% of the variation in the dependent
variable is accounted for by the variations in the
independent variable. The rest of the variation,
0.19, or 19 %, is unexplained.

You might also like